pandas

pandas - a powerful data analysis and manipulation library for Python

pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with “relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing practical, real world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful and flexible open source data analysis / manipulation tool available in any language. It is already well on its way toward this goal.

Main Features

Here are just a few of the things that pandas does well:

  • Easy handling of missing data in floating point as well as non-floating point data.

  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

  • Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations.

  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data.

  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects.

  • Intelligent label-based slicing, fancy indexing, and subsetting of large data sets.

  • Intuitive merging and joining data sets.

  • Flexible reshaping and pivoting of data sets.

  • Hierarchical labeling of axes (possible to have multiple labels per tick).

  • Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving/loading data from the ultrafast HDF5 format.

  • Time series-specific functionality: date range generation and frequency conversion, moving window statistics, date shifting and lagging.

class pandas.ArrowDtype[source]

An ExtensionDtype for PyArrow data types.

Warning

ArrowDtype is considered experimental. The implementation and parts of the API may change without warning.

While most dtype arguments can accept the “string” constructor, e.g. "int64[pyarrow]", ArrowDtype is useful if the data type contains parameters like pyarrow.timestamp.

Parameters:

pyarrow_dtype (pa.DataType) – An instance of a pyarrow.DataType.

pyarrow_dtype
None()
Return type:

ArrowDtype

Parameters:

pyarrow_dtype (pa.DataType) –

Examples

>>> import pyarrow as pa
>>> pd.ArrowDtype(pa.int64())
int64[pyarrow]

Types with parameters must be constructed with ArrowDtype.

>>> pd.ArrowDtype(pa.timestamp("s", tz="America/New_York"))
timestamp[s, tz=America/New_York][pyarrow]
>>> pd.ArrowDtype(pa.list_(pa.int64()))
list<item: int64>[pyarrow]
property type

Returns associated scalar type.

property name: str

A string identifying the data type.

numpy_dtype

Return an instance of the related numpy dtype

kind
itemsize

Return the number of bytes in this dtype

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

classmethod construct_from_string(string)[source]

Construct this type from a string.

Parameters:

string (str) – string should follow the format f”{pyarrow_type}[pyarrow]” e.g. int64[pyarrow]

Return type:

ArrowDtype

class pandas.BooleanDtype[source]

Extension dtype for boolean data.

Warning

BooleanDtype is considered experimental. The implementation and parts of the API may change without warning.

None
None()

Examples

>>> pd.BooleanDtype()
BooleanDtype
name: str = 'boolean'
property type: type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

property kind: str

A character code (one of ‘biufcmMOSUV’), default ‘O’

This should match the NumPy dtype used when the array is converted to an ndarray, which is probably ‘O’ for object if the extension type cannot be represented as a built-in NumPy type.

See also

numpy.dtype.kind

property numpy_dtype: dtype

Return an instance of our numpy dtype

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

class pandas.Categorical[source]

Represent a categorical variable in classic R / S-plus fashion.

Categoricals can only take on a limited, and usually fixed, number of possible values (categories). In contrast to statistical categorical variables, a Categorical might have an order, but numerical operations (additions, divisions, …) are not possible.

All values of the Categorical are either in categories or np.nan. Assigning values outside of categories will raise a ValueError. Order is defined by the order of the categories, not lexical order of the values.

Parameters:
  • values (list-like) – The values of the categorical. If categories are given, values not in categories will be replaced with NaN.

  • categories (Index-like (unique), optional) – The unique categories for this categorical. If not given, the categories are assumed to be the unique values of values (sorted, if possible, otherwise in the order in which they appear).

  • ordered (bool, default False) – Whether or not this categorical is treated as a ordered categorical. If True, the resulting categorical will be ordered. An ordered categorical respects, when sorted, the order of its categories attribute (which in turn is the categories argument, if provided).

  • dtype (CategoricalDtype) – An instance of CategoricalDtype to use for this categorical.

  • fastpath (bool) –

  • copy (bool) –

categories

The categories of this categorical

Type:

Index

codes

The codes (integer positions, which point to the categories) of this categorical, read only.

Type:

ndarray

ordered

Whether or not this Categorical is ordered.

Type:

bool

dtype

The instance of CategoricalDtype storing the categories and ordered.

Type:

CategoricalDtype

from_codes()[source]
Parameters:

dtype (ExtensionDtype | str | dtype | Type[str | complex | bool | object] | None) –

Return type:

Categorical

__array__()[source]
Raises:
  • ValueError – If the categories do not validate.

  • TypeError – If an explicit ordered=True is given but no categories and the values are not sortable.

Parameters:
  • dtype (Dtype | None) –

  • fastpath (bool) –

  • copy (bool) –

See also

CategoricalDtype

Type for categorical data.

CategoricalIndex

An Index with an underlying Categorical.

Notes

See the user guide for more.

Examples

>>> pd.Categorical([1, 2, 3, 1, 2, 3])
[1, 2, 3, 1, 2, 3]
Categories (3, int64): [1, 2, 3]
>>> pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

Missing values are not included as a category.

>>> c = pd.Categorical([1, 2, 3, 1, 2, 3, np.nan])
>>> c
[1, 2, 3, 1, 2, 3, NaN]
Categories (3, int64): [1, 2, 3]

However, their presence is indicated in the codes attribute by code -1.

>>> c.codes
array([ 0,  1,  2,  0,  1,  2, -1], dtype=int8)

Ordered Categoricals can be sorted according to the custom order of the categories and can have a min and max value.

>>> c = pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'], ordered=True,
...                    categories=['c', 'b', 'a'])
>>> c
['a', 'b', 'c', 'a', 'b', 'c']
Categories (3, object): ['c' < 'b' < 'a']
>>> c.min()
'c'
property dtype: CategoricalDtype

The CategoricalDtype for this instance.

to_list()[source]

Alias for tolist.

classmethod from_codes(codes, categories=None, ordered=None, dtype=None)[source]

Make a Categorical type from codes and categories or dtype.

This constructor is useful if you already have codes and categories/dtype and so do not need the (computation intensive) factorization step, which is usually done on the constructor.

If your data does not follow this convention, please use the normal constructor.

Parameters:
  • codes (array-like of int) – An integer array, where each integer points to a category in categories or dtype.categories, or else is -1 for NaN.

  • categories (index-like, optional) – The categories for the categorical. Items need to be unique. If the categories are not given here, then they must be provided in dtype.

  • ordered (bool, optional) – Whether or not this categorical is treated as an ordered categorical. If not given here or in dtype, the resulting categorical will be unordered.

  • dtype (CategoricalDtype or "category", optional) – If CategoricalDtype, cannot be used together with categories or ordered.

Return type:

Categorical

Examples

>>> dtype = pd.CategoricalDtype(['a', 'b'], ordered=True)
>>> pd.Categorical.from_codes(codes=[0, 1, 0, 1], dtype=dtype)
['a', 'b', 'a', 'b']
Categories (2, object): ['a' < 'b']
property categories: Index

The categories of this categorical.

Setting assigns new values to each category (effectively a rename of each individual category).

The assigned value has to be a list-like object. All items must be unique and the number of items in the new categories must be the same as the number of items in the old categories.

Raises:

ValueError – If the new categories do not validate as categories or if the number of new categories is unequal the number of old categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

property ordered: bool | None

Whether the categories have an ordered relationship.

property codes: ndarray

The category codes of this categorical.

Codes are an array of integers which are the positions of the actual values in the categories array.

There is no setter, use the other categorical methods and the normal item setter to change values in the categorical.

Returns:

A non-writable view of the codes array.

Return type:

ndarray[int]

set_ordered(value)[source]

Set the ordered attribute to the boolean value.

Parameters:

value (bool) – Set whether this categorical is ordered (True) or not (False).

Return type:

Categorical

as_ordered()[source]

Set the Categorical to be ordered.

Returns:

Ordered Categorical.

Return type:

Categorical

as_unordered()[source]

Set the Categorical to be unordered.

Returns:

Unordered Categorical.

Return type:

Categorical

set_categories(new_categories, ordered=None, rename=False)[source]

Set the categories to the specified new_categories.

new_categories can include new categories (which will result in unused categories) or remove old categories (which results in values set to NaN). If rename==True, the categories will simple be renamed (less or more items than in old categories will result in values set to NaN or in unused categories respectively).

This method can be used to perform more than one action of adding, removing, and reordering simultaneously and is therefore faster than performing the individual steps via the more specialised methods.

On the other hand this methods does not do checks (e.g., whether the old categories are included in the new categories on a reorder), which can result in surprising changes, for example when using special string dtypes, which does not considers a S1 string equal to a single char python string.

Parameters:
  • new_categories (Index-like) – The categories in new order.

  • ordered (bool, default False) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.

  • rename (bool, default False) – Whether or not the new_categories should be considered as a rename of the old categories or as reordered categories.

Return type:

Categorical with reordered categories.

Raises:

ValueError – If new_categories does not validate as categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

rename_categories(new_categories)[source]

Rename categories.

Parameters:

new_categories (list-like, dict-like or callable) –

New categories which will replace old categories.

  • list-like: all items must be unique and the number of items in the new categories must match the existing number of categories.

  • dict-like: specifies a mapping from old categories to new. Categories not contained in the mapping are passed through and extra categories in the mapping are ignored.

  • callable : a callable that is called on all items in the old categories and whose return values comprise the new categories.

Returns:

Categorical with renamed categories.

Return type:

Categorical

Raises:

ValueError – If new categories are list-like and do not have the same number of items than the current categories or do not validate as categories

See also

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['a', 'a', 'b'])
>>> c.rename_categories([0, 1])
[0, 0, 1]
Categories (2, int64): [0, 1]

For dict-like new_categories, extra keys are ignored and categories not in the dictionary are passed through

>>> c.rename_categories({'a': 'A', 'c': 'C'})
['A', 'A', 'b']
Categories (2, object): ['A', 'b']

You may also provide a callable to create the new categories

>>> c.rename_categories(lambda x: x.upper())
['A', 'A', 'B']
Categories (2, object): ['A', 'B']
reorder_categories(new_categories, ordered=None)[source]

Reorder categories as specified in new_categories.

new_categories need to include all old categories and no new category items.

Parameters:
  • new_categories (Index-like) – The categories in new order.

  • ordered (bool, optional) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.

Returns:

Categorical with reordered categories.

Return type:

Categorical

Raises:

ValueError – If the new categories do not contain all old category items or any new ones

See also

rename_categories

Rename categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

add_categories(new_categories)[source]

Add new categories.

new_categories will be included at the last/highest place in the categories and will be unused directly after this call.

Parameters:

new_categories (category or list-like of category) – The new categories to be included.

Returns:

Categorical with new categories added.

Return type:

Categorical

Raises:

ValueError – If the new categories include old categories or do not validate as categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['c', 'b', 'c'])
>>> c
['c', 'b', 'c']
Categories (2, object): ['b', 'c']
>>> c.add_categories(['d', 'a'])
['c', 'b', 'c']
Categories (4, object): ['b', 'c', 'd', 'a']
remove_categories(removals)[source]

Remove the specified categories.

removals must be included in the old categories. Values which were in the removed categories will be set to NaN

Parameters:

removals (category or list of categories) – The categories which should be removed.

Returns:

Categorical with removed categories.

Return type:

Categorical

Raises:

ValueError – If the removals are not contained in the categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd'])
>>> c
['a', 'c', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_categories(['d', 'a'])
[NaN, 'c', 'b', 'c', NaN]
Categories (2, object): ['b', 'c']
remove_unused_categories()[source]

Remove categories which are not used.

Returns:

Categorical with unused categories dropped.

Return type:

Categorical

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd'])
>>> c
['a', 'c', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> c[2] = 'a'
>>> c[4] = 'c'
>>> c
['a', 'c', 'a', 'c', 'c']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_unused_categories()
['a', 'c', 'a', 'c', 'c']
Categories (2, object): ['a', 'c']
map(mapper)[source]

Map categories using an input mapping or function.

Maps the categories to new categories. If the mapping correspondence is one-to-one the result is a Categorical which has the same order property as the original, otherwise a Index is returned. NaN values are unaffected.

If a dict or Series is used any unmapped category is mapped to NaN. Note that if this happens an Index will be returned.

Parameters:

mapper (function, dict, or Series) – Mapping correspondence.

Returns:

Mapped categorical.

Return type:

pandas.Categorical or pandas.Index

See also

CategoricalIndex.map

Apply a mapping correspondence on a CategoricalIndex.

Index.map

Apply a mapping correspondence on an Index.

Series.map

Apply a mapping correspondence on a Series.

Series.apply

Apply more complex functions on a Series.

Examples

>>> cat = pd.Categorical(['a', 'b', 'c'])
>>> cat
['a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> cat.map(lambda x: x.upper())
['A', 'B', 'C']
Categories (3, object): ['A', 'B', 'C']
>>> cat.map({'a': 'first', 'b': 'second', 'c': 'third'})
['first', 'second', 'third']
Categories (3, object): ['first', 'second', 'third']

If the mapping is one-to-one the ordering of the categories is preserved:

>>> cat = pd.Categorical(['a', 'b', 'c'], ordered=True)
>>> cat
['a', 'b', 'c']
Categories (3, object): ['a' < 'b' < 'c']
>>> cat.map({'a': 3, 'b': 2, 'c': 1})
[3, 2, 1]
Categories (3, int64): [3 < 2 < 1]

If the mapping is not one-to-one an Index is returned:

>>> cat.map({'a': 'first', 'b': 'second', 'c': 'first'})
Index(['first', 'second', 'first'], dtype='object')

If a dict is used, all unmapped categories are mapped to NaN and the result is an Index:

>>> cat.map({'a': 'first', 'b': 'second'})
Index(['first', 'second', nan], dtype='object')
property nbytes: int

The number of bytes needed to store this object in memory.

memory_usage(deep=False)[source]

Memory usage of my values

Parameters:

deep (bool) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption

Return type:

bytes used

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False

See also

numpy.ndarray.nbytes

isna()[source]

Detect missing values

Missing values (-1 in .codes) are detected.

Return type:

np.ndarray[bool] of whether my values are null

See also

isna

Top-level isna.

isnull

Alias of isna.

Categorical.notna

Boolean inverse of Categorical.isna.

isnull()

Detect missing values

Missing values (-1 in .codes) are detected.

Return type:

np.ndarray[bool] of whether my values are null

See also

isna

Top-level isna.

isnull

Alias of isna.

Categorical.notna

Boolean inverse of Categorical.isna.

notna()[source]

Inverse of isna

Both missing values (-1 in .codes) and NA as a category are detected as null.

Return type:

np.ndarray[bool] of whether my values are not null

See also

notna

Top-level notna.

notnull

Alias of notna.

Categorical.isna

Boolean inverse of Categorical.notna.

notnull()

Inverse of isna

Both missing values (-1 in .codes) and NA as a category are detected as null.

Return type:

np.ndarray[bool] of whether my values are not null

See also

notna

Top-level notna.

notnull

Alias of notna.

Categorical.isna

Boolean inverse of Categorical.notna.

value_counts(dropna=True)[source]

Return a Series containing counts of each category.

Every category will have an entry, even those with a count of 0.

Parameters:

dropna (bool, default True) – Don’t include counts of NaN.

Returns:

counts

Return type:

Series

See also

Series.value_counts

check_for_ordered(op)[source]

assert that we are ordered

Return type:

None

argsort(*, ascending=True, kind='quicksort', **kwargs)[source]

Return the indices that would sort the Categorical.

Missing values are sorted at the end.

Parameters:
  • ascending (bool, default True) – Whether the indices should result in an ascending or descending sort.

  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, optional) – Sorting algorithm.

  • **kwargs – passed through to numpy.argsort().

Return type:

np.ndarray[np.intp]

See also

numpy.ndarray.argsort

Notes

While an ordering is applied to the category values, arg-sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions ‘Categorical.min’ and ‘Categorical.max’.

Examples

>>> pd.Categorical(['b', 'b', 'a', 'c']).argsort()
array([2, 0, 1, 3])
>>> cat = pd.Categorical(['b', 'b', 'a', 'c'],
...                      categories=['c', 'b', 'a'],
...                      ordered=True)
>>> cat.argsort()
array([3, 0, 1, 2])

Missing values are placed at the end

>>> cat = pd.Categorical([2, None, 1])
>>> cat.argsort()
array([2, 0, 1])
sort_values(*, inplace: Literal[False] = False, ascending: bool = True, na_position: str = 'last') Categorical[source]
sort_values(*, inplace: Literal[True], ascending: bool = True, na_position: str = 'last') None

Sort the Categorical by category value returning a new Categorical by default.

While an ordering is applied to the category values, sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions ‘Categorical.min’ and ‘Categorical.max’.

Parameters:
  • inplace (bool, default False) – Do operation in place.

  • ascending (bool, default True) – Order ascending. Passing False orders descending. The ordering parameter provides the method by which the category values are organized.

  • na_position ({'first', 'last'} (optional, default='last')) – ‘first’ puts NaNs at the beginning ‘last’ puts NaNs at the end

Return type:

Categorical or None

See also

Categorical.sort, Series.sort_values

Examples

>>> c = pd.Categorical([1, 2, 2, 1, 5])
>>> c
[1, 2, 2, 1, 5]
Categories (3, int64): [1, 2, 5]
>>> c.sort_values()
[1, 1, 2, 2, 5]
Categories (3, int64): [1, 2, 5]
>>> c.sort_values(ascending=False)
[5, 2, 2, 1, 1]
Categories (3, int64): [1, 2, 5]
>>> c = pd.Categorical([1, 2, 2, 1, 5])

‘sort_values’ behaviour with NaNs. Note that ‘na_position’ is independent of the ‘ascending’ parameter:

>>> c = pd.Categorical([np.nan, 2, 2, np.nan, 5])
>>> c
[NaN, 2, 2, NaN, 5]
Categories (2, int64): [2, 5]
>>> c.sort_values()
[2, 2, 5, NaN, NaN]
Categories (2, int64): [2, 5]
>>> c.sort_values(ascending=False)
[5, 2, 2, NaN, NaN]
Categories (2, int64): [2, 5]
>>> c.sort_values(na_position='first')
[NaN, NaN, 2, 2, 5]
Categories (2, int64): [2, 5]
>>> c.sort_values(ascending=False, na_position='first')
[NaN, NaN, 5, 2, 2]
Categories (2, int64): [2, 5]
min(*, skipna=True, **kwargs)[source]

The minimum value of the object.

Only ordered Categoricals have a minimum!

Raises:

TypeError – If the Categorical is not ordered.

Returns:

min

Return type:

the minimum of this Categorical, NA value if empty

Parameters:

skipna (bool) –

max(*, skipna=True, **kwargs)[source]

The maximum value of the object.

Only ordered Categoricals have a maximum!

Raises:

TypeError – If the Categorical is not ordered.

Returns:

max

Return type:

the maximum of this Categorical, NA if array is empty

Parameters:

skipna (bool) –

unique()[source]

Return the Categorical which categories and codes are unique.

Changed in version 1.3.0: Previously, unused categories were dropped from the new categories.

Return type:

Categorical

See also

pandas.unique, CategoricalIndex.unique

Series.unique

Return unique values of Series object.

Examples

>>> pd.Categorical(list("baabc")).unique()
['b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> pd.Categorical(list("baab"), categories=list("abc"), ordered=True).unique()
['b', 'a']
Categories (3, object): ['a' < 'b' < 'c']
equals(other)[source]

Returns True if categorical arrays are equal.

Parameters:

other (Categorical) –

Return type:

bool

describe()[source]

Describes this Categorical

Returns:

description – A dataframe with frequency and counts by category.

Return type:

DataFrame

isin(values)[source]

Check whether values are contained in Categorical.

Return a boolean NumPy Array showing whether each element in the Categorical matches an element in the passed sequence of values exactly.

Parameters:

values (set or list-like) – The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

Return type:

np.ndarray[bool]

Raises:

TypeError

  • If values is not a set or list-like

See also

pandas.Series.isin

Equivalent method on Series.

Examples

>>> s = pd.Categorical(['lama', 'cow', 'lama', 'beetle', 'lama',
...                'hippo'])
>>> s.isin(['cow', 'lama'])
array([ True,  True,  True, False,  True, False])

Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:

>>> s.isin(['lama'])
array([ True, False,  True, False,  True, False])
class pandas.CategoricalDtype[source]

Type for categorical data with the categories and orderedness.

Parameters:
  • categories (sequence, optional) – Must be unique, and must not contain any nulls. The categories are stored in an Index, and if an index is provided the dtype of that index will be used.

  • ordered (bool or None, default False) – Whether or not this categorical is treated as a ordered categorical. None can be used to maintain the ordered value of existing categoricals when used in operations that combine categoricals, e.g. astype, and will resolve to False if there is no existing ordered to maintain.

categories
ordered
None()

See also

Categorical

Represent a categorical variable in classic R / S-plus fashion.

Notes

This class is useful for specifying the type of a Categorical independent of the values. See categorical.categoricaldtype for more.

Examples

>>> t = pd.CategoricalDtype(categories=['b', 'a'], ordered=True)
>>> pd.Series(['a', 'b', 'a', 'c'], dtype=t)
0      a
1      b
2      a
3    NaN
dtype: category
Categories (2, object): ['b' < 'a']

An empty CategoricalDtype with a specific dtype can be created by providing an empty index. As follows,

>>> pd.CategoricalDtype(pd.DatetimeIndex([])).categories.dtype
dtype('<M8[ns]')
name = 'category'
type

alias of CategoricalDtypeType

kind: str = 'O'
str: str = '|O08'
base: dtype | ExtensionDtype | None = dtype('O')
classmethod construct_from_string(string)[source]

Construct a CategoricalDtype from a string.

Parameters:

string (str) – Must be the string “category” in order to be successfully constructed.

Returns:

Instance of the dtype.

Return type:

CategoricalDtype

Raises:

TypeError – If a CategoricalDtype cannot be constructed from the input.

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

static validate_ordered(ordered)[source]

Validates that we have a valid ordered parameter. If it is not a boolean, a TypeError will be raised.

Parameters:

ordered (object) – The parameter to be verified.

Raises:

TypeError – If ‘ordered’ is not a boolean.

Return type:

None

static validate_categories(categories, fastpath=False)[source]

Validates that we have good categories

Parameters:
  • categories (array-like) –

  • fastpath (bool) – Whether to skip nan and uniqueness checks

Returns:

categories

Return type:

Index

update_dtype(dtype)[source]

Returns a CategoricalDtype with categories and ordered taken from dtype if specified, otherwise falling back to self if unspecified

Parameters:

dtype (CategoricalDtype) –

Returns:

new_dtype

Return type:

CategoricalDtype

property categories: Index

An Index containing the unique categories allowed.

property ordered: bool | None

Whether the categories have an ordered relationship.

class pandas.CategoricalIndex[source]

Index based on an underlying Categorical.

CategoricalIndex, like Categorical, can only take on a limited, and usually fixed, number of possible values (categories). Also, like Categorical, it might have an order, but numerical operations (additions, divisions, …) are not possible.

Parameters:
  • data (array-like (1-dimensional)) – The values of the categorical. If categories are given, values not in categories will be replaced with NaN.

  • categories (index-like, optional) – The categories for the categorical. Items need to be unique. If the categories are not given here (and also not in dtype), they will be inferred from the data.

  • ordered (bool, optional) – Whether or not this categorical is treated as an ordered categorical. If not given here or in dtype, the resulting categorical will be unordered.

  • dtype (CategoricalDtype or "category", optional) – If CategoricalDtype, cannot be used together with categories or ordered.

  • copy (bool, default False) – Make a copy of input ndarray.

  • name (object, optional) – Name to be stored in the index.

Return type:

CategoricalIndex

codes
Type:

np.ndarray

categories
Type:

Index

ordered
Type:

bool | None

rename_categories()
reorder_categories()
add_categories()
remove_categories()
remove_unused_categories()
set_categories()
as_ordered()
as_unordered()
map()[source]
Raises:
  • ValueError – If the categories do not validate.

  • TypeError – If an explicit ordered=True is given but no categories and the values are not sortable.

Parameters:
  • dtype (Dtype | None) –

  • copy (bool) –

  • name (Hashable) –

Return type:

CategoricalIndex

See also

Index

The base pandas Index type.

Categorical

A categorical array.

CategoricalDtype

Type for categorical data.

Notes

See the user guide for more.

Examples

>>> pd.CategoricalIndex(["a", "b", "c", "a", "b", "c"])
CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'],
                 categories=['a', 'b', 'c'], ordered=False, dtype='category')

CategoricalIndex can also be instantiated from a Categorical:

>>> c = pd.Categorical(["a", "b", "c", "a", "b", "c"])
>>> pd.CategoricalIndex(c)
CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'],
                 categories=['a', 'b', 'c'], ordered=False, dtype='category')

Ordered CategoricalIndex can have a min and max value.

>>> ci = pd.CategoricalIndex(
...     ["a", "b", "c", "a", "b", "c"], ordered=True, categories=["c", "b", "a"]
... )
>>> ci
CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'],
                 categories=['c', 'b', 'a'], ordered=True, dtype='category')
>>> ci.min()
'c'
property codes

The category codes of this categorical.

Codes are an array of integers which are the positions of the actual values in the categories array.

There is no setter, use the other categorical methods and the normal item setter to change values in the categorical.

Returns:

A non-writable view of the codes array.

Return type:

ndarray[int]

property categories

The categories of this categorical.

Setting assigns new values to each category (effectively a rename of each individual category).

The assigned value has to be a list-like object. All items must be unique and the number of items in the new categories must be the same as the number of items in the old categories.

Raises:

ValueError – If the new categories do not validate as categories or if the number of new categories is unequal the number of old categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

property ordered

Whether the categories have an ordered relationship.

equals(other)[source]

Determine if two CategoricalIndex objects contain the same elements.

Returns:

If two CategoricalIndex objects have equal elements True, otherwise False.

Return type:

bool

Parameters:

other (object) –

property inferred_type: str

Return a string of the type inferred from the values.

reindex(target, method=None, level=None, limit=None, tolerance=None)[source]

Create index with target’s values (move/add/delete values as necessary)

Returns:

  • new_index (pd.Index) – Resulting index

  • indexer (np.ndarray[np.intp] or None) – Indices of output values in original index

Return type:

tuple[Index, npt.NDArray[np.intp] | None]

map(mapper)[source]

Map values using input an input mapping or function.

Maps the values (their categories, not the codes) of the index to new categories. If the mapping correspondence is one-to-one the result is a CategoricalIndex which has the same order property as the original, otherwise an Index is returned.

If a dict or Series is used any unmapped category is mapped to NaN. Note that if this happens an Index will be returned.

Parameters:

mapper (function, dict, or Series) – Mapping correspondence.

Returns:

Mapped index.

Return type:

pandas.CategoricalIndex or pandas.Index

See also

Index.map

Apply a mapping correspondence on an Index.

Series.map

Apply a mapping correspondence on a Series.

Series.apply

Apply more complex functions on a Series.

Examples

>>> idx = pd.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'],
                  ordered=False, dtype='category')
>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'], categories=['A', 'B', 'C'],
                 ordered=False, dtype='category')
>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first',
                 'second', 'third'], ordered=False, dtype='category')

If the mapping is one-to-one the ordering of the categories is preserved:

>>> idx = pd.CategoricalIndex(['a', 'b', 'c'], ordered=True)
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'],
                 ordered=True, dtype='category')
>>> idx.map({'a': 3, 'b': 2, 'c': 1})
CategoricalIndex([3, 2, 1], categories=[3, 2, 1], ordered=True,
                 dtype='category')

If the mapping is not one-to-one an Index is returned:

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'first'})
Index(['first', 'second', 'first'], dtype='object')

If a dict is used, all unmapped categories are mapped to NaN and the result is an Index:

>>> idx.map({'a': 'first', 'b': 'second'})
Index(['first', 'second', nan], dtype='object')
add_categories(*args, **kwargs)

Add new categories.

new_categories will be included at the last/highest place in the categories and will be unused directly after this call.

Parameters:

new_categories (category or list-like of category) – The new categories to be included.

Returns:

Categorical with new categories added.

Return type:

Categorical

Raises:

ValueError – If the new categories include old categories or do not validate as categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['c', 'b', 'c'])
>>> c
['c', 'b', 'c']
Categories (2, object): ['b', 'c']
>>> c.add_categories(['d', 'a'])
['c', 'b', 'c']
Categories (4, object): ['b', 'c', 'd', 'a']
argsort(*args, **kwargs)

Return the indices that would sort the Categorical.

Missing values are sorted at the end.

Parameters:
  • ascending (bool, default True) – Whether the indices should result in an ascending or descending sort.

  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, optional) – Sorting algorithm.

  • **kwargs – passed through to numpy.argsort().

Return type:

np.ndarray[np.intp]

See also

numpy.ndarray.argsort

Notes

While an ordering is applied to the category values, arg-sorting in this context refers more to organizing and grouping together based on matching category values. Thus, this function can be called on an unordered Categorical instance unlike the functions ‘Categorical.min’ and ‘Categorical.max’.

Examples

>>> pd.Categorical(['b', 'b', 'a', 'c']).argsort()
array([2, 0, 1, 3])
>>> cat = pd.Categorical(['b', 'b', 'a', 'c'],
...                      categories=['c', 'b', 'a'],
...                      ordered=True)
>>> cat.argsort()
array([3, 0, 1, 2])

Missing values are placed at the end

>>> cat = pd.Categorical([2, None, 1])
>>> cat.argsort()
array([2, 0, 1])
as_ordered(*args, **kwargs)

Set the Categorical to be ordered.

Returns:

Ordered Categorical.

Return type:

Categorical

as_unordered(*args, **kwargs)

Set the Categorical to be unordered.

Returns:

Unordered Categorical.

Return type:

Categorical

max(*args, **kwargs)

The maximum value of the object.

Only ordered Categoricals have a maximum!

Raises:

TypeError – If the Categorical is not ordered.

Returns:

max

Return type:

the maximum of this Categorical, NA if array is empty

min(*args, **kwargs)

The minimum value of the object.

Only ordered Categoricals have a minimum!

Raises:

TypeError – If the Categorical is not ordered.

Returns:

min

Return type:

the minimum of this Categorical, NA value if empty

remove_categories(*args, **kwargs)

Remove the specified categories.

removals must be included in the old categories. Values which were in the removed categories will be set to NaN

Parameters:

removals (category or list of categories) – The categories which should be removed.

Returns:

Categorical with removed categories.

Return type:

Categorical

Raises:

ValueError – If the removals are not contained in the categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd'])
>>> c
['a', 'c', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_categories(['d', 'a'])
[NaN, 'c', 'b', 'c', NaN]
Categories (2, object): ['b', 'c']
remove_unused_categories(*args, **kwargs)

Remove categories which are not used.

Returns:

Categorical with unused categories dropped.

Return type:

Categorical

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['a', 'c', 'b', 'c', 'd'])
>>> c
['a', 'c', 'b', 'c', 'd']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> c[2] = 'a'
>>> c[4] = 'c'
>>> c
['a', 'c', 'a', 'c', 'c']
Categories (4, object): ['a', 'b', 'c', 'd']
>>> c.remove_unused_categories()
['a', 'c', 'a', 'c', 'c']
Categories (2, object): ['a', 'c']
rename_categories(*args, **kwargs)

Rename categories.

Parameters:

new_categories (list-like, dict-like or callable) –

New categories which will replace old categories.

  • list-like: all items must be unique and the number of items in the new categories must match the existing number of categories.

  • dict-like: specifies a mapping from old categories to new. Categories not contained in the mapping are passed through and extra categories in the mapping are ignored.

  • callable : a callable that is called on all items in the old categories and whose return values comprise the new categories.

Returns:

Categorical with renamed categories.

Return type:

Categorical

Raises:

ValueError – If new categories are list-like and do not have the same number of items than the current categories or do not validate as categories

See also

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

Examples

>>> c = pd.Categorical(['a', 'a', 'b'])
>>> c.rename_categories([0, 1])
[0, 0, 1]
Categories (2, int64): [0, 1]

For dict-like new_categories, extra keys are ignored and categories not in the dictionary are passed through

>>> c.rename_categories({'a': 'A', 'c': 'C'})
['A', 'A', 'b']
Categories (2, object): ['A', 'b']

You may also provide a callable to create the new categories

>>> c.rename_categories(lambda x: x.upper())
['A', 'A', 'B']
Categories (2, object): ['A', 'B']
reorder_categories(*args, **kwargs)

Reorder categories as specified in new_categories.

new_categories need to include all old categories and no new category items.

Parameters:
  • new_categories (Index-like) – The categories in new order.

  • ordered (bool, optional) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.

Returns:

Categorical with reordered categories.

Return type:

Categorical

Raises:

ValueError – If the new categories do not contain all old category items or any new ones

See also

rename_categories

Rename categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

set_categories

Set the categories to the specified ones.

searchsorted(*args, **kwargs)

Find indices where elements should be inserted to maintain order.

Find the indices into a sorted array self (a) such that, if the corresponding elements in value were inserted before the indices, the order of self would be preserved.

Assuming that self is sorted:

side

returned index i satisfies

left

self[i-1] < value <= self[i]

right

self[i-1] <= value < self[i]

Parameters:
  • value (array-like, list or scalar) – Value(s) to insert into self.

  • side ({'left', 'right'}, optional) – If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of self).

  • sorter (1-D array-like, optional) – Optional array of integer indices that sort array a into ascending order. They are typically the result of argsort.

Returns:

If value is array-like, array of insertion points. If value is scalar, a single integer.

Return type:

array of ints or int

See also

numpy.searchsorted

Similar method from NumPy.

set_categories(*args, **kwargs)

Set the categories to the specified new_categories.

new_categories can include new categories (which will result in unused categories) or remove old categories (which results in values set to NaN). If rename==True, the categories will simple be renamed (less or more items than in old categories will result in values set to NaN or in unused categories respectively).

This method can be used to perform more than one action of adding, removing, and reordering simultaneously and is therefore faster than performing the individual steps via the more specialised methods.

On the other hand this methods does not do checks (e.g., whether the old categories are included in the new categories on a reorder), which can result in surprising changes, for example when using special string dtypes, which does not considers a S1 string equal to a single char python string.

Parameters:
  • new_categories (Index-like) – The categories in new order.

  • ordered (bool, default False) – Whether or not the categorical is treated as a ordered categorical. If not given, do not change the ordered information.

  • rename (bool, default False) – Whether or not the new_categories should be considered as a rename of the old categories or as reordered categories.

Return type:

Categorical with reordered categories.

Raises:

ValueError – If new_categories does not validate as categories

See also

rename_categories

Rename categories.

reorder_categories

Reorder categories.

add_categories

Add new categories.

remove_categories

Remove the specified categories.

remove_unused_categories

Remove categories which are not used.

tolist(*args, **kwargs)

Return a list of the values.

These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)

Return type:

list

class pandas.DataFrame[source]

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns). Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary pandas data structure.

Parameters:
  • data (ndarray (structured or homogeneous), Iterable, dict, or DataFrame) –

    Dict can contain Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order. If a dict contains Series which have an index defined, it is aligned by its index. This alignment also occurs if data is a Series or a DataFrame itself. Alignment is done on Series/DataFrame inputs.

    If data is a list of dicts, column order follows insertion-order.

  • index (Index or array-like) – Index to use for resulting frame. Will default to RangeIndex if no indexing information part of input data and no index provided.

  • columns (Index or array-like) – Column labels to use for resulting frame when data does not have them, defaulting to RangeIndex(0, 1, 2, …, n). If data contains column labels, will perform column selection instead.

  • dtype (dtype, default None) – Data type to force. Only a single dtype is allowed. If None, infer.

  • copy (bool or None, default None) –

    Copy data from inputs. For dict data, the default of None behaves like copy=True. For DataFrame or 2d ndarray input, the default of None behaves like copy=False. If data is a dict containing one or more Series (possibly of different dtypes), copy=False will ensure that these inputs are not copied.

    Changed in version 1.3.0.

See also

DataFrame.from_records

Constructor from tuples, also record arrays.

DataFrame.from_dict

From dicts of Series, arrays, or dicts.

read_csv

Read a comma-separated values (csv) file into DataFrame.

read_table

Read general delimited file into DataFrame.

read_clipboard

Read text from clipboard into DataFrame.

Notes

Please reference the User Guide for more information.

Examples

Constructing DataFrame from a dictionary.

>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df = pd.DataFrame(data=d)
>>> df
   col1  col2
0     1     3
1     2     4

Notice that the inferred dtype is int64.

>>> df.dtypes
col1    int64
col2    int64
dtype: object

To enforce a single dtype:

>>> df = pd.DataFrame(data=d, dtype=np.int8)
>>> df.dtypes
col1    int8
col2    int8
dtype: object

Constructing DataFrame from a dictionary including Series:

>>> d = {'col1': [0, 1, 2, 3], 'col2': pd.Series([2, 3], index=[2, 3])}
>>> pd.DataFrame(data=d, index=[0, 1, 2, 3])
   col1  col2
0     0   NaN
1     1   NaN
2     2   2.0
3     3   3.0

Constructing DataFrame from numpy ndarray:

>>> df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
...                    columns=['a', 'b', 'c'])
>>> df2
   a  b  c
0  1  2  3
1  4  5  6
2  7  8  9

Constructing DataFrame from a numpy ndarray that has labeled columns:

>>> data = np.array([(1, 2, 3), (4, 5, 6), (7, 8, 9)],
...                 dtype=[("a", "i4"), ("b", "i4"), ("c", "i4")])
>>> df3 = pd.DataFrame(data, columns=['c', 'a'])
...
>>> df3
   c  a
0  3  1
1  6  4
2  9  7

Constructing DataFrame from dataclass:

>>> from dataclasses import make_dataclass
>>> Point = make_dataclass("Point", [("x", int), ("y", int)])
>>> pd.DataFrame([Point(0, 0), Point(0, 3), Point(2, 3)])
   x  y
0  0  0
1  0  3
2  2  3

Constructing DataFrame from Series/DataFrame:

>>> ser = pd.Series([1, 2, 3], index=["a", "b", "c"])
>>> df = pd.DataFrame(data=ser, index=["a", "c"])
>>> df
   0
a  1
c  3
>>> df1 = pd.DataFrame([1, 2, 3], index=["a", "b", "c"], columns=["x"])
>>> df2 = pd.DataFrame(data=df1, index=["a", "c"])
>>> df2
   x
a  1
c  3
property axes: list[pandas.core.indexes.base.Index]

Return a list representing the axes of the DataFrame.

It has the row axis labels and column axis labels as the only members. They are returned in that order.

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df.axes
[RangeIndex(start=0, stop=2, step=1), Index(['col1', 'col2'],
dtype='object')]
property shape: tuple[int, int]

Return a tuple representing the dimensionality of the DataFrame.

See also

ndarray.shape

Tuple of array dimensions.

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df.shape
(2, 2)
>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4],
...                    'col3': [5, 6]})
>>> df.shape
(2, 3)
to_string(buf: None = None, columns: Sequence[str] | None = None, col_space: int | list[int] | dict[Hashable, int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: fmt.FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool = False, decimal: str = '.', line_width: int | None = None, min_rows: int | None = None, max_colwidth: int | None = None, encoding: str | None = None) str[source]
to_string(buf: FilePath | WriteBuffer[str], columns: Sequence[str] | None = None, col_space: int | list[int] | dict[Hashable, int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: fmt.FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool = False, decimal: str = '.', line_width: int | None = None, min_rows: int | None = None, max_colwidth: int | None = None, encoding: str | None = None) None

Render a DataFrame to a console-friendly tabular output.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.

  • columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.

  • col_space (int, list or dict of int, optional) – The minimum width of each column. If a list of ints is given every integers corresponds with one column. If a dict is given, the key references the column, while the value defines the space to use..

  • header (bool or sequence of str, optional) – Write out the column names. If a list of strings is given, it is assumed to be aliases for the column names.

  • index (bool, optional, default True) – Whether to print index (row) labels.

  • na_rep (str, optional, default 'NaN') – String representation of NaN to use.

  • formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

  • float_format (one-parameter function, optional, default None) –

    Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

    Changed in version 1.2.0.

  • sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.

  • index_names (bool, optional, default True) – Prints the names of the indexes.

  • justify (str, default None) –

    How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

    • left

    • right

    • center

    • justify

    • justify-all

    • start

    • end

    • inherit

    • match-parent

    • initial

    • unset.

  • max_rows (int, optional) – Maximum number of rows to display in the console.

  • max_cols (int, optional) – Maximum number of columns to display in the console.

  • show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).

  • decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.

  • line_width (int, optional) – Width to wrap a line in characters.

  • min_rows (int, optional) – The number of rows to display in the console in a truncated repr (when number of rows is above max_rows).

  • max_colwidth (int, optional) – Max width to truncate each column in characters. By default, no limit.

  • encoding (str, default "utf-8") – Set character encoding.

Returns:

If buf is None, returns the result as a string. Otherwise returns None.

Return type:

str or None

See also

to_html

Convert DataFrame to HTML.

Examples

>>> d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
>>> df = pd.DataFrame(d)
>>> print(df.to_string())
   col1  col2
0     1     4
1     2     5
2     3     6
property style: Styler

Returns a Styler object.

Contains methods for building a styled HTML representation of the DataFrame.

See also

io.formats.style.Styler

Helps style a DataFrame or Series according to the data with HTML and CSS.

items()[source]

Iterate over (column name, Series) pairs.

Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.

Yields:
  • label (object) – The column names for the DataFrame being iterated over.

  • content (Series) – The column entries belonging to each label, as a Series.

Return type:

Iterable[tuple[Hashable, pandas.core.series.Series]]

See also

DataFrame.iterrows

Iterate over DataFrame rows as (index, Series) pairs.

DataFrame.itertuples

Iterate over DataFrame rows as namedtuples of the values.

Examples

>>> df = pd.DataFrame({'species': ['bear', 'bear', 'marsupial'],
...                   'population': [1864, 22000, 80000]},
...                   index=['panda', 'polar', 'koala'])
>>> df
        species   population
panda   bear      1864
polar   bear      22000
koala   marsupial 80000
>>> for label, content in df.items():
...     print(f'label: {label}')
...     print(f'content: {content}', sep='\n')
...
label: species
content:
panda         bear
polar         bear
koala    marsupial
Name: species, dtype: object
label: population
content:
panda     1864
polar    22000
koala    80000
Name: population, dtype: int64
iterrows()[source]

Iterate over DataFrame rows as (index, Series) pairs.

Yields:
  • index (label or tuple of label) – The index of the row. A tuple for a MultiIndex.

  • data (Series) – The data of the row as a Series.

Return type:

Iterable[tuple[Hashable, pandas.core.series.Series]]

See also

DataFrame.itertuples

Iterate over DataFrame rows as namedtuples of the values.

DataFrame.items

Iterate over (column name, Series) pairs.

Notes

  1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows (dtypes are preserved across columns for DataFrames). For example,

    >>> df = pd.DataFrame([[1, 1.5]], columns=['int', 'float'])
    >>> row = next(df.iterrows())[1]
    >>> row
    int      1.0
    float    1.5
    Name: 0, dtype: float64
    >>> print(row['int'].dtype)
    float64
    >>> print(df['int'].dtype)
    int64
    

    To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the values and which is generally faster than iterrows.

  2. You should never modify something you are iterating over. This is not guaranteed to work in all cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect.

itertuples(index=True, name='Pandas')[source]

Iterate over DataFrame rows as namedtuples.

Parameters:
  • index (bool, default True) – If True, return the index as the first element of the tuple.

  • name (str or None, default "Pandas") – The name of the returned namedtuples or None to return regular tuples.

Returns:

An object to iterate over namedtuples for each row in the DataFrame with the first field possibly being the index and following fields being the column values.

Return type:

iterator

See also

DataFrame.iterrows

Iterate over DataFrame rows as (index, Series) pairs.

DataFrame.items

Iterate over (column name, Series) pairs.

Notes

The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start with an underscore.

Examples

>>> df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},
...                   index=['dog', 'hawk'])
>>> df
      num_legs  num_wings
dog          4          0
hawk         2          2
>>> for row in df.itertuples():
...     print(row)
...
Pandas(Index='dog', num_legs=4, num_wings=0)
Pandas(Index='hawk', num_legs=2, num_wings=2)

By setting the index parameter to False we can remove the index as the first element of the tuple:

>>> for row in df.itertuples(index=False):
...     print(row)
...
Pandas(num_legs=4, num_wings=0)
Pandas(num_legs=2, num_wings=2)

With the name parameter set we set a custom name for the yielded namedtuples:

>>> for row in df.itertuples(name='Animal'):
...     print(row)
...
Animal(Index='dog', num_legs=4, num_wings=0)
Animal(Index='hawk', num_legs=2, num_wings=2)
dot(other: Series) Series[source]
dot(other: DataFrame | Index | ExtensionArray | ndarray) DataFrame

Compute the matrix multiplication between the DataFrame and other.

This method computes the matrix product between the DataFrame and the values of an other Series, DataFrame or a numpy array.

It can also be called using self @ other in Python >= 3.5.

Parameters:

other (Series, DataFrame or array-like) – The other object to compute the matrix product with.

Returns:

If other is a Series, return the matrix product between self and other as a Series. If other is a DataFrame or a numpy.array, return the matrix product of self and other in a DataFrame of a np.array.

Return type:

Series or DataFrame

See also

Series.dot

Similar method for Series.

Notes

The dimensions of DataFrame and other must be compatible in order to compute the matrix multiplication. In addition, the column names of DataFrame and the index of other must contain the same values, as they will be aligned prior to the multiplication.

The dot method for Series computes the inner product, instead of the matrix product here.

Examples

Here we multiply a DataFrame with a Series.

>>> df = pd.DataFrame([[0, 1, -2, -1], [1, 1, 1, 1]])
>>> s = pd.Series([1, 1, 2, 1])
>>> df.dot(s)
0    -4
1     5
dtype: int64

Here we multiply a DataFrame with another DataFrame.

>>> other = pd.DataFrame([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(other)
    0   1
0   1   4
1   2   2

Note that the dot method give the same result as @

>>> df @ other
    0   1
0   1   4
1   2   2

The dot method works also if other is an np.array.

>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(arr)
    0   1
0   1   4
1   2   2

Note how shuffling of the objects does not change the result.

>>> s2 = s.reindex([1, 0, 2, 3])
>>> df.dot(s2)
0    -4
1     5
dtype: int64
classmethod from_dict(data, orient='columns', dtype=None, columns=None)[source]

Construct DataFrame from dict of array-like or dicts.

Creates DataFrame object from dictionary by columns or by index allowing dtype specification.

Parameters:
  • data (dict) – Of the form {field : array-like} or {field : dict}.

  • orient ({'columns', 'index', 'tight'}, default 'columns') –

    The “orientation” of the data. If the keys of the passed dict should be the columns of the resulting DataFrame, pass ‘columns’ (default). Otherwise if the keys should be rows, pass ‘index’. If ‘tight’, assume a dict with keys [‘index’, ‘columns’, ‘data’, ‘index_names’, ‘column_names’].

    New in version 1.4.0: ‘tight’ as an allowed value for the orient argument

  • dtype (dtype, default None) – Data type to force after DataFrame construction, otherwise infer.

  • columns (list, default None) – Column labels to use when orient='index'. Raises a ValueError if used with orient='columns' or orient='tight'.

Return type:

DataFrame

See also

DataFrame.from_records

DataFrame from structured ndarray, sequence of tuples or dicts, or DataFrame.

DataFrame

DataFrame object creation using constructor.

DataFrame.to_dict

Convert the DataFrame to a dictionary.

Examples

By default the keys of the dict become the DataFrame columns:

>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Specify orient='index' to create the DataFrame using dictionary keys as rows:

>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
       0  1  2  3
row_1  3  2  1  0
row_2  a  b  c  d

When using the ‘index’ orientation, the column names can be specified manually:

>>> pd.DataFrame.from_dict(data, orient='index',
...                        columns=['A', 'B', 'C', 'D'])
       A  B  C  D
row_1  3  2  1  0
row_2  a  b  c  d

Specify orient='tight' to create the DataFrame using a ‘tight’ format:

>>> data = {'index': [('a', 'b'), ('a', 'c')],
...         'columns': [('x', 1), ('y', 2)],
...         'data': [[1, 3], [2, 4]],
...         'index_names': ['n1', 'n2'],
...         'column_names': ['z1', 'z2']}
>>> pd.DataFrame.from_dict(data, orient='tight')
z1     x  y
z2     1  2
n1 n2
a  b   1  3
   c   2  4
to_numpy(dtype=None, copy=False, na_value=_NoDefault.no_default)[source]

Convert the DataFrame to a NumPy array.

By default, the dtype of the returned array will be the common NumPy dtype of all types in the DataFrame. For example, if the dtypes are float16 and float32, the results dtype will be float32. This may require copying data and coercing values, which may be expensive.

Parameters:
  • dtype (str or numpy.dtype, optional) – The dtype to pass to numpy.asarray().

  • copy (bool, default False) – Whether to ensure that the returned value is not a view on another array. Note that copy=False does not ensure that to_numpy() is no-copy. Rather, copy=True ensure that a copy is made, even if not strictly necessary.

  • na_value (Any, optional) –

    The value to use for missing values. The default value depends on dtype and the dtypes of the DataFrame columns.

    New in version 1.1.0.

Return type:

numpy.ndarray

See also

Series.to_numpy

Similar method for Series.

Examples

>>> pd.DataFrame({"A": [1, 2], "B": [3, 4]}).to_numpy()
array([[1, 3],
       [2, 4]])

With heterogeneous data, the lowest common type will have to be used.

>>> df = pd.DataFrame({"A": [1, 2], "B": [3.0, 4.5]})
>>> df.to_numpy()
array([[1. , 3. ],
       [2. , 4.5]])

For a mix of numeric and non-numeric types, the output array will have object dtype.

>>> df['C'] = pd.date_range('2000', periods=2)
>>> df.to_numpy()
array([[1, 3.0, Timestamp('2000-01-01 00:00:00')],
       [2, 4.5, Timestamp('2000-01-02 00:00:00')]], dtype=object)
to_dict(orient: ~typing.Literal['dict', 'list', 'series', 'split', 'tight', 'index'] = 'dict', into: type[dict] = <class 'dict'>) dict[source]
to_dict(orient: ~typing.Literal['records'], into: type[dict] = <class 'dict'>) list[dict]

Convert the DataFrame to a dictionary.

The type of the key-value pairs can be customized with the parameters (see below).

Parameters:
  • orient (str {'dict', 'list', 'series', 'split', 'tight', 'records', 'index'}) –

    Determines the type of the values of the dictionary.

    • ’dict’ (default) : dict like {column -> {index -> value}}

    • ’list’ : dict like {column -> [values]}

    • ’series’ : dict like {column -> Series(values)}

    • ’split’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values]}

    • ’tight’ : dict like {‘index’ -> [index], ‘columns’ -> [columns], ‘data’ -> [values], ‘index_names’ -> [index.names], ‘column_names’ -> [column.names]}

    • ’records’ : list like [{column -> value}, … , {column -> value}]

    • ’index’ : dict like {index -> {column -> value}}

    New in version 1.4.0: ‘tight’ as an allowed value for the orient argument

  • into (class, default dict) – The collections.abc.Mapping subclass used for all Mappings in the return value. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.

  • index (bool, default True) –

    Whether to include the index item (and index_names item if orient is ‘tight’) in the returned dictionary. Can only be False when orient is ‘split’ or ‘tight’.

    New in version 2.0.0.

Returns:

Return a collections.abc.Mapping object representing the DataFrame. The resulting transformation depends on the orient parameter.

Return type:

dict, list or collections.abc.Mapping

See also

DataFrame.from_dict

Create a DataFrame from a dictionary.

DataFrame.to_json

Convert a DataFrame to JSON format.

Examples

>>> df = pd.DataFrame({'col1': [1, 2],
...                    'col2': [0.5, 0.75]},
...                   index=['row1', 'row2'])
>>> df
      col1  col2
row1     1  0.50
row2     2  0.75
>>> df.to_dict()
{'col1': {'row1': 1, 'row2': 2}, 'col2': {'row1': 0.5, 'row2': 0.75}}

You can specify the return orientation.

>>> df.to_dict('series')
{'col1': row1    1
         row2    2
Name: col1, dtype: int64,
'col2': row1    0.50
        row2    0.75
Name: col2, dtype: float64}
>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
>>> df.to_dict('tight')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
 'data': [[1, 0.5], [2, 0.75]], 'index_names': [None], 'column_names': [None]}

You can also specify the mapping type.

>>> from collections import OrderedDict, defaultdict
>>> df.to_dict(into=OrderedDict)
OrderedDict([('col1', OrderedDict([('row1', 1), ('row2', 2)])),
             ('col2', OrderedDict([('row1', 0.5), ('row2', 0.75)]))])

If you want a defaultdict, you need to initialize it:

>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
 defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
to_gbq(destination_table, project_id=None, chunksize=None, reauth=False, if_exists='fail', auth_local_webserver=True, table_schema=None, location=None, progress_bar=True, credentials=None)[source]

Write a DataFrame to a Google BigQuery table.

This function requires the pandas-gbq package.

See the How to authenticate with Google BigQuery guide for authentication instructions.

Parameters:
  • destination_table (str) – Name of table to be written, in the form dataset.tablename.

  • project_id (str, optional) – Google BigQuery Account project ID. Optional when available from the environment.

  • chunksize (int, optional) – Number of rows to be inserted in each chunk from the dataframe. Set to None to load the whole dataframe at once.

  • reauth (bool, default False) – Force Google BigQuery to re-authenticate the user. This is useful if multiple accounts are used.

  • if_exists (str, default 'fail') –

    Behavior when the destination table exists. Value can be one of:

    'fail'

    If table exists raise pandas_gbq.gbq.TableCreationError.

    'replace'

    If table exists, drop it, recreate it, and insert data.

    'append'

    If table exists, insert data. Create if does not exist.

  • auth_local_webserver (bool, default True) –

    Use the local webserver flow instead of the console flow when getting user credentials.

    New in version 0.2.0 of pandas-gbq.

    Changed in version 1.5.0: Default value is changed to True. Google has deprecated the auth_local_webserver = False “out of band” (copy-paste) flow.

  • table_schema (list of dicts, optional) –

    List of BigQuery table fields to which according DataFrame columns conform to, e.g. [{'name': 'col1', 'type': 'STRING'},...]. If schema is not provided, it will be generated according to dtypes of DataFrame columns. See BigQuery API documentation on available names of a field.

    New in version 0.3.1 of pandas-gbq.

  • location (str, optional) –

    Location where the load job should run. See the BigQuery locations documentation for a list of available locations. The location must match that of the target dataset.

    New in version 0.5.0 of pandas-gbq.

  • progress_bar (bool, default True) –

    Use the library tqdm to show the progress bar for the upload, chunk by chunk.

    New in version 0.5.0 of pandas-gbq.

  • credentials (google.auth.credentials.Credentials, optional) –

    Credentials for accessing Google APIs. Use this parameter to override default credentials, such as to use Compute Engine google.auth.compute_engine.Credentials or Service Account google.oauth2.service_account.Credentials directly.

    New in version 0.8.0 of pandas-gbq.

Return type:

None

See also

pandas_gbq.to_gbq

This function in the pandas-gbq library.

read_gbq

Read a DataFrame from Google BigQuery.

classmethod from_records(data, index=None, exclude=None, columns=None, coerce_float=False, nrows=None)[source]

Convert structured or record ndarray to DataFrame.

Creates a DataFrame object from a structured ndarray, sequence of tuples or dicts, or DataFrame.

Parameters:
  • data (structured ndarray, sequence of tuples or dicts, or DataFrame) – Structured input data.

  • index (str, list of fields, array-like) – Field of array to use as the index, alternately a specific set of input labels to use.

  • exclude (sequence, default None) – Columns or fields to exclude.

  • columns (sequence, default None) – Column names to use. If the passed data do not have names associated with them, this argument provides names for the columns. Otherwise this argument indicates the order of the columns in the result (any names not found in the data will become all-NA columns).

  • coerce_float (bool, default False) – Attempt to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.

  • nrows (int, default None) – Number of rows to read if data is an iterator.

Return type:

DataFrame

See also

DataFrame.from_dict

DataFrame from dict of array-like or dicts.

DataFrame

DataFrame object creation using constructor.

Examples

Data can be provided as a structured ndarray:

>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],
...                 dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Data can be provided as a list of dicts:

>>> data = [{'col_1': 3, 'col_2': 'a'},
...         {'col_1': 2, 'col_2': 'b'},
...         {'col_1': 1, 'col_2': 'c'},
...         {'col_1': 0, 'col_2': 'd'}]
>>> pd.DataFrame.from_records(data)
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d

Data can be provided as a list of tuples with corresponding columns:

>>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')]
>>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2'])
   col_1 col_2
0      3     a
1      2     b
2      1     c
3      0     d
to_records(index=True, column_dtypes=None, index_dtypes=None)[source]

Convert DataFrame to a NumPy record array.

Index will be included as the first field of the record array if requested.

Parameters:
  • index (bool, default True) – Include index in resulting record array, stored in ‘index’ field or using the index label, if set.

  • column_dtypes (str, type, dict, default None) – If a string or type, the data type to store all columns. If a dictionary, a mapping of column names and indices (zero-indexed) to specific data types.

  • index_dtypes (str, type, dict, default None) –

    If a string or type, the data type to store all index levels. If a dictionary, a mapping of index level names and indices (zero-indexed) to specific data types.

    This mapping is applied only if index=True.

Returns:

NumPy ndarray with the DataFrame labels as fields and each row of the DataFrame as entries.

Return type:

numpy.recarray

See also

DataFrame.from_records

Convert structured or record ndarray to DataFrame.

numpy.recarray

An ndarray that allows field access using attributes, analogous to typed columns in a spreadsheet.

Examples

>>> df = pd.DataFrame({'A': [1, 2], 'B': [0.5, 0.75]},
...                   index=['a', 'b'])
>>> df
   A     B
a  1  0.50
b  2  0.75
>>> df.to_records()
rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)],
          dtype=[('index', 'O'), ('A', '<i8'), ('B', '<f8')])

If the DataFrame index has no label then the recarray field name is set to ‘index’. If the index has a label then this is used as the field name:

>>> df.index = df.index.rename("I")
>>> df.to_records()
rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)],
          dtype=[('I', 'O'), ('A', '<i8'), ('B', '<f8')])

The index can be excluded from the record array:

>>> df.to_records(index=False)
rec.array([(1, 0.5 ), (2, 0.75)],
          dtype=[('A', '<i8'), ('B', '<f8')])

Data types can be specified for the columns:

>>> df.to_records(column_dtypes={"A": "int32"})
rec.array([('a', 1, 0.5 ), ('b', 2, 0.75)],
          dtype=[('I', 'O'), ('A', '<i4'), ('B', '<f8')])

As well as for the index:

>>> df.to_records(index_dtypes="<S2")
rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)],
          dtype=[('I', 'S2'), ('A', '<i8'), ('B', '<f8')])
>>> index_dtypes = f"<S{df.index.str.len().max()}"
>>> df.to_records(index_dtypes=index_dtypes)
rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)],
          dtype=[('I', 'S1'), ('A', '<i8'), ('B', '<f8')])
to_stata(path, *, convert_dates=None, write_index=True, byteorder=None, time_stamp=None, data_label=None, variable_labels=None, version=114, convert_strl=None, compression='infer', storage_options=None, value_labels=None)[source]

Export DataFrame object to Stata dta format.

Writes the DataFrame to a Stata dataset file. “dta” files contain a Stata dataset.

Parameters:
  • path (str, path object, or buffer) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function.

  • convert_dates (dict) – Dictionary mapping columns containing datetime types to stata internal format to use when writing the dates. Options are ‘tc’, ‘td’, ‘tm’, ‘tw’, ‘th’, ‘tq’, ‘ty’. Column can be either an integer or a name. Datetime columns that do not have a conversion type specified will be converted to ‘tc’. Raises NotImplementedError if a datetime column has timezone information.

  • write_index (bool) – Write the index to Stata dataset.

  • byteorder (str) – Can be “>”, “<”, “little”, or “big”. default is sys.byteorder.

  • time_stamp (datetime) – A datetime to use as file creation date. Default is the current time.

  • data_label (str, optional) – A label for the data set. Must be 80 characters or smaller.

  • variable_labels (dict) – Dictionary containing columns as keys and variable labels as values. Each label must be 80 characters or smaller.

  • version ({114, 117, 118, 119, None}, default 114) –

    Version to use in the output dta file. Set to None to let pandas decide between 118 or 119 formats depending on the number of columns in the frame. Version 114 can be read by Stata 10 and later. Version 117 can be read by Stata 13 or later. Version 118 is supported in Stata 14 and later. Version 119 is supported in Stata 15 and later. Version 114 limits string variables to 244 characters or fewer while versions 117 and later allow strings with lengths up to 2,000,000 characters. Versions 118 and 119 support Unicode characters, and version 119 supports more than 32,767 variables.

    Version 119 should usually only be used when the number of variables exceeds the capacity of dta format 118. Exporting smaller datasets in format 119 may have unintended consequences, and, as of November 2020, Stata SE cannot read version 119 files.

  • convert_strl (list, optional) – List of column names to convert to string columns to Stata StrL format. Only available if version is 117. Storing strings in the StrL format can produce smaller dta files if strings have more than 8 characters and values are repeated.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    New in version 1.5.0: Added support for .tar files.

    New in version 1.1.0.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • value_labels (dict of dicts) –

    Dictionary containing columns as keys and dictionaries of column value to labels as values. Labels for a single variable must be 32,000 characters or smaller.

    New in version 1.4.0.

Raises:
  • NotImplementedError

    • If datetimes contain timezone information * Column dtype is not representable in Stata

  • ValueError

    • Columns listed in convert_dates are neither datetime64[ns] or datetime.datetime * Column listed in convert_dates is not in DataFrame * Categorical label contains more than 32,000 characters

Return type:

None

See also

read_stata

Import Stata data files.

io.stata.StataWriter

Low-level writer for Stata data files.

io.stata.StataWriter117

Low-level writer for version 117 files.

Examples

>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon',
...                               'parrot'],
...                    'speed': [350, 18, 361, 15]})
>>> df.to_stata('animals.dta')  
to_feather(path, **kwargs)[source]

Write a DataFrame to the binary Feather format.

Parameters:
  • path (str, path object, file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If a string or a path, it will be used as Root Directory path when writing a partitioned dataset.

  • **kwargs

    Additional keywords passed to pyarrow.feather.write_feather(). Starting with pyarrow 0.17, this includes the compression, compression_level, chunksize and version keywords.

    New in version 1.1.0.

Return type:

None

Notes

This function writes the dataframe as a feather file. Requires a default index. For saving the DataFrame with your custom index use a method that supports custom indices e.g. to_parquet.

to_markdown(buf=None, mode='wt', index=True, storage_options=None, **kwargs)[source]

Print DataFrame in Markdown-friendly format.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.

  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) –

    Add index (row) labels.

    New in version 1.1.0.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • **kwargs – These parameters will be passed to tabulate.

Returns:

DataFrame in Markdown-friendly format.

Return type:

str

Notes

Requires the tabulate package.

Examples
>>> df = pd.DataFrame(
...     data={"animal_1": ["elk", "pig"], "animal_2": ["dog", "quetzal"]}
... )
>>> print(df.to_markdown())
|    | animal_1   | animal_2   |
|---:|:-----------|:-----------|
|  0 | elk        | dog        |
|  1 | pig        | quetzal    |

Output markdown with a tabulate option.

>>> print(df.to_markdown(tablefmt="grid"))
+----+------------+------------+
|    | animal_1   | animal_2   |
+====+============+============+
|  0 | elk        | dog        |
+----+------------+------------+
|  1 | pig        | quetzal    |
+----+------------+------------+
to_parquet(path: None = None, engine: str = 'auto', compression: str | None = 'snappy', index: bool | None = None, partition_cols: list[str] | None = None, storage_options: Dict[str, Any] | None = None, **kwargs) bytes[source]
to_parquet(path: FilePath | WriteBuffer[bytes], engine: str = 'auto', compression: str | None = 'snappy', index: bool | None = None, partition_cols: list[str] | None = None, storage_options: Dict[str, Any] | None = None, **kwargs) None

Write a DataFrame to the binary parquet format.

This function writes the dataframe as a parquet file. You can choose different parquet backends, and have the option of compression. See the user guide for more details.

Parameters:
  • path (str, path object, file-like object, or None, default None) –

    String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. If None, the result is returned as bytes. If a string or path, it will be used as Root Directory path when writing a partitioned dataset.

    Changed in version 1.2.0.

    Previously this was “fname”

  • engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

  • compression ({'snappy', 'gzip', 'brotli', None}, default 'snappy') – Name of the compression to use. Use None for no compression.

  • index (bool, default None) – If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to True the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

  • partition_cols (list, optional, default None) – Column names by which to partition the dataset. Columns are partitioned in the order they are given. Must be None if path is not a string.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • **kwargs – Additional arguments passed to the parquet library. See pandas io for more details.

Return type:

bytes if no path argument is provided else None

See also

read_parquet

Read a parquet file.

DataFrame.to_orc

Write an orc file.

DataFrame.to_csv

Write a csv file.

DataFrame.to_sql

Write to a sql table.

DataFrame.to_hdf

Write to hdf.

Notes

This function requires either the fastparquet or pyarrow library.

Examples

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
>>> df.to_parquet('df.parquet.gzip',
...               compression='gzip')  
>>> pd.read_parquet('df.parquet.gzip')  
   col1  col2
0     1     3
1     2     4

If you want to get a buffer to the parquet content you can use a io.BytesIO object, as long as you don’t use partition_cols, which creates multiple files.

>>> import io
>>> f = io.BytesIO()
>>> df.to_parquet(f)
>>> f.seek(0)
0
>>> content = f.read()
to_orc(path=None, *, engine='pyarrow', index=None, engine_kwargs=None)[source]

Write a DataFrame to the ORC format.

New in version 1.5.0.

Parameters:
  • path (str, file-like object or None, default None) – If a string, it will be used as Root Directory path when writing a partitioned dataset. By file-like object, we refer to objects with a write() method, such as a file handle (e.g. via builtin open function). If path is None, a bytes object is returned.

  • engine (str, default 'pyarrow') – ORC library to use. Pyarrow must be >= 7.0.0.

  • index (bool, optional) – If True, include the dataframe’s index(es) in the file output. If False, they will not be written to the file. If None, similar to infer the dataframe’s index(es) will be saved. However, instead of being saved as values, the RangeIndex will be stored as a range in the metadata so it doesn’t require much space and is faster. Other indexes will be included as columns in the file output.

  • engine_kwargs (dict[str, Any] or None, default None) – Additional keyword arguments passed to pyarrow.orc.write_table().

Return type:

bytes if no path argument is provided else None

Raises:
  • NotImplementedError – Dtype of one or more columns is category, unsigned integers, interval, period or sparse.

  • ValueError – engine is not pyarrow.

See also

read_orc

Read a ORC file.

DataFrame.to_parquet

Write a parquet file.

DataFrame.to_csv

Write a csv file.

DataFrame.to_sql

Write to a sql table.

DataFrame.to_hdf

Write to hdf.

Notes

  • Before using this function you should read the user guide about ORC and install optional dependencies.

  • This function requires pyarrow library.

  • For supported dtypes please refer to supported ORC features in Arrow.

  • Currently timezones in datetime columns are not preserved when a dataframe is converted into ORC files.

Examples

>>> df = pd.DataFrame(data={'col1': [1, 2], 'col2': [4, 3]})
>>> df.to_orc('df.orc')  
>>> pd.read_orc('df.orc')  
   col1  col2
0     1     4
1     2     3

If you want to get a buffer to the orc content you can write it to io.BytesIO >>> import io >>> b = io.BytesIO(df.to_orc()) # doctest: +SKIP >>> b.seek(0) # doctest: +SKIP 0 >>> content = b.read() # doctest: +SKIP

to_html(buf: FilePath | WriteBuffer[str], columns: Sequence[Hashable] | None = None, col_space: str | int | Sequence[str | int] | Mapping[Hashable, str | int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool | str = False, decimal: str = '.', bold_rows: bool = True, classes: str | list | tuple | None = None, escape: bool = True, notebook: bool = False, border: int | bool | None = None, table_id: str | None = None, render_links: bool = False, encoding: str | None = None) None[source]
to_html(buf: None = None, columns: Sequence[Hashable] | None = None, col_space: str | int | Sequence[str | int] | Mapping[Hashable, str | int] | None = None, header: bool | Sequence[str] = True, index: bool = True, na_rep: str = 'NaN', formatters: List[Callable] | Tuple[Callable, ...] | Mapping[str | int, Callable] | None = None, float_format: FloatFormatType | None = None, sparsify: bool | None = None, index_names: bool = True, justify: str | None = None, max_rows: int | None = None, max_cols: int | None = None, show_dimensions: bool | str = False, decimal: str = '.', bold_rows: bool = True, classes: str | list | tuple | None = None, escape: bool = True, notebook: bool = False, border: int | bool | None = None, table_id: str | None = None, render_links: bool = False, encoding: str | None = None) str

Render a DataFrame as an HTML table.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.

  • columns (sequence, optional, default None) – The subset of columns to write. Writes all columns by default.

  • col_space (str or int, list or dict of int or str, optional) – The minimum width of each column in CSS length units. An int is assumed to be px units..

  • header (bool, optional) – Whether to print column labels, default True.

  • index (bool, optional, default True) – Whether to print index (row) labels.

  • na_rep (str, optional, default 'NaN') – String representation of NaN to use.

  • formatters (list, tuple or dict of one-param. functions, optional) – Formatter functions to apply to columns’ elements by position or name. The result of each function must be a unicode string. List/tuple must be of length equal to the number of columns.

  • float_format (one-parameter function, optional, default None) –

    Formatter function to apply to columns’ elements if they are floats. This function must return a unicode string and will be applied only to the non-NaN elements, with NaN being handled by na_rep.

    Changed in version 1.2.0.

  • sparsify (bool, optional, default True) – Set to False for a DataFrame with a hierarchical index to print every multiindex key at each row.

  • index_names (bool, optional, default True) – Prints the names of the indexes.

  • justify (str, default None) –

    How to justify the column labels. If None uses the option from the print configuration (controlled by set_option), ‘right’ out of the box. Valid values are

    • left

    • right

    • center

    • justify

    • justify-all

    • start

    • end

    • inherit

    • match-parent

    • initial

    • unset.

  • max_rows (int, optional) – Maximum number of rows to display in the console.

  • max_cols (int, optional) – Maximum number of columns to display in the console.

  • show_dimensions (bool, default False) – Display DataFrame dimensions (number of rows by number of columns).

  • decimal (str, default '.') – Character recognized as decimal separator, e.g. ‘,’ in Europe.

  • bold_rows (bool, default True) – Make the row labels bold in the output.

  • classes (str or list or tuple, default None) – CSS class(es) to apply to the resulting html table.

  • escape (bool, default True) – Convert the characters <, >, and & to HTML-safe sequences.

  • notebook ({True, False}, default False) – Whether the generated HTML is for IPython Notebook.

  • border (int) – A border=border attribute is included in the opening <table> tag. Default pd.options.display.html.border.

  • table_id (str, optional) – A css id is included in the opening <table> tag if specified.

  • render_links (bool, default False) – Convert URLs to HTML links.

  • encoding (str, default "utf-8") –

    Set character encoding.

    New in version 1.0.

Returns:

If buf is None, returns the result as a string. Otherwise returns None.

Return type:

str or None

See also

to_string

Convert DataFrame to a string.

to_xml(path_or_buffer=None, index=True, root_name='data', row_name='row', na_rep=None, attr_cols=None, elem_cols=None, namespaces=None, prefix=None, encoding='utf-8', xml_declaration=True, pretty_print=True, parser='lxml', stylesheet=None, compression='infer', storage_options=None)[source]

Render a DataFrame to an XML document.

New in version 1.3.0.

Parameters:
  • path_or_buffer (str, path object, file-like object, or None, default None) – String, path object (implementing os.PathLike[str]), or file-like object implementing a write() function. If None, the result is returned as a string.

  • index (bool, default True) – Whether to include index in XML document.

  • root_name (str, default 'data') – The name of root element in XML document.

  • row_name (str, default 'row') – The name of row element in XML document.

  • na_rep (str, optional) – Missing data representation.

  • attr_cols (list-like, optional) – List of columns to write as attributes in row element. Hierarchical columns will be flattened with underscore delimiting the different levels.

  • elem_cols (list-like, optional) – List of columns to write as children in row element. By default, all columns output as children of row element. Hierarchical columns will be flattened with underscore delimiting the different levels.

  • namespaces (dict, optional) –

    All namespaces to be defined in root element. Keys of dict should be prefix names and values of dict corresponding URIs. Default namespaces should be given empty string key. For example,

    namespaces = {"": "https://example.com"}
    

  • prefix (str, optional) – Namespace prefix to be used for every element and/or attribute in document. This should be one of the keys in namespaces dict.

  • encoding (str, default 'utf-8') – Encoding of the resulting document.

  • xml_declaration (bool, default True) – Whether to include the XML declaration at start of document.

  • pretty_print (bool, default True) – Whether output should be pretty printed with indentation and line breaks.

  • parser ({'lxml','etree'}, default 'lxml') – Parser module to use for building of tree. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’, the ability to use XSLT stylesheet is supported.

  • stylesheet (str, path object or file-like object, optional) – A URL, file-like object, or a raw string containing an XSLT script used to transform the raw XML output. Script should use layout of elements and attributes from original output. This argument requires lxml to be installed. Only XSLT 1.0 scripts and not later versions is currently supported.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘path_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

Returns:

If io is None, returns the resulting XML format as a string. Otherwise returns None.

Return type:

None or str

See also

to_json

Convert the pandas object to a JSON string.

to_html

Convert DataFrame to a html.

Examples

>>> df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
...                    'degrees': [360, 360, 180],
...                    'sides': [4, np.nan, 3]})
>>> df.to_xml()  
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row>
    <index>0</index>
    <shape>square</shape>
    <degrees>360</degrees>
    <sides>4.0</sides>
  </row>
  <row>
    <index>1</index>
    <shape>circle</shape>
    <degrees>360</degrees>
    <sides/>
  </row>
  <row>
    <index>2</index>
    <shape>triangle</shape>
    <degrees>180</degrees>
    <sides>3.0</sides>
  </row>
</data>
>>> df.to_xml(attr_cols=[
...           'index', 'shape', 'degrees', 'sides'
...           ])  
<?xml version='1.0' encoding='utf-8'?>
<data>
  <row index="0" shape="square" degrees="360" sides="4.0"/>
  <row index="1" shape="circle" degrees="360"/>
  <row index="2" shape="triangle" degrees="180" sides="3.0"/>
</data>
>>> df.to_xml(namespaces={"doc": "https://example.com"},
...           prefix="doc")  
<?xml version='1.0' encoding='utf-8'?>
<doc:data xmlns:doc="https://example.com">
  <doc:row>
    <doc:index>0</doc:index>
    <doc:shape>square</doc:shape>
    <doc:degrees>360</doc:degrees>
    <doc:sides>4.0</doc:sides>
  </doc:row>
  <doc:row>
    <doc:index>1</doc:index>
    <doc:shape>circle</doc:shape>
    <doc:degrees>360</doc:degrees>
    <doc:sides/>
  </doc:row>
  <doc:row>
    <doc:index>2</doc:index>
    <doc:shape>triangle</doc:shape>
    <doc:degrees>180</doc:degrees>
    <doc:sides>3.0</doc:sides>
  </doc:row>
</doc:data>
info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=None)[source]

Print a concise summary of a DataFrame.

This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.

Parameters:
  • verbose (bool, optional) – Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columns is followed.

  • buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.

  • max_cols (int, optional) – When to switch from the verbose to the truncated output. If the DataFrame has more than max_cols columns, the truncated output is used. By default, the setting in pandas.options.display.max_info_columns is used.

  • memory_usage (bool, str, optional) –

    Specifies whether total memory usage of the DataFrame elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.

    True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources. See the Frequently Asked Questions for more details.

  • show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.

Returns:

This method prints a summary of a DataFrame and returns None.

Return type:

None

See also

DataFrame.describe

Generate descriptive statistics of DataFrame columns.

DataFrame.memory_usage

Memory usage of DataFrame columns.

Examples

>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> float_values = [0.0, 0.25, 0.5, 0.75, 1.0]
>>> df = pd.DataFrame({"int_col": int_values, "text_col": text_values,
...                   "float_col": float_values})
>>> df
    int_col text_col  float_col
0        1    alpha       0.00
1        2     beta       0.25
2        3    gamma       0.50
3        4    delta       0.75
4        5  epsilon       1.00

Prints information of all columns:

>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   int_col    5 non-null      int64
 1   text_col   5 non-null      object
 2   float_col  5 non-null      float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Prints a summary of columns count and its dtypes but not per column information:

>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes

Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:

>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
...           encoding="utf-8") as f:  
...     f.write(s)
260

The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and fine-tune memory optimization:

>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
>>> df = pd.DataFrame({
...     'column_1': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_2': np.random.choice(['a', 'b', 'c'], 10 ** 6),
...     'column_3': np.random.choice(['a', 'b', 'c'], 10 ** 6)
... })
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 22.9+ MB
>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
 #   Column    Non-Null Count    Dtype
---  ------    --------------    -----
 0   column_1  1000000 non-null  object
 1   column_2  1000000 non-null  object
 2   column_3  1000000 non-null  object
dtypes: object(3)
memory usage: 165.9 MB
memory_usage(index=True, deep=False)[source]

Return the memory usage of each column in bytes.

The memory usage can optionally include the contribution of the index and elements of object dtype.

This value is displayed in DataFrame.info by default. This can be suppressed by setting pandas.options.display.memory_usage to False.

Parameters:
  • index (bool, default True) – Specifies whether to include the memory usage of the DataFrame’s index in returned Series. If index=True, the memory usage of the index is the first item in the output.

  • deep (bool, default False) – If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned values.

Returns:

A Series whose index is the original column names and whose values is the memory usage of each column in bytes.

Return type:

Series

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of an ndarray.

Series.memory_usage

Bytes consumed by a Series.

Categorical

Memory-efficient array for string values with many repeated values.

DataFrame.info

Concise summary of a DataFrame.

Notes

See the Frequently Asked Questions for more details.

Examples

>>> dtypes = ['int64', 'float64', 'complex128', 'object', 'bool']
>>> data = dict([(t, np.ones(shape=5000, dtype=int).astype(t))
...              for t in dtypes])
>>> df = pd.DataFrame(data)
>>> df.head()
   int64  float64            complex128  object  bool
0      1      1.0              1.0+0.0j       1  True
1      1      1.0              1.0+0.0j       1  True
2      1      1.0              1.0+0.0j       1  True
3      1      1.0              1.0+0.0j       1  True
4      1      1.0              1.0+0.0j       1  True
>>> df.memory_usage()
Index           128
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64
>>> df.memory_usage(index=False)
int64         40000
float64       40000
complex128    80000
object        40000
bool           5000
dtype: int64

The memory footprint of object dtype columns is ignored by default:

>>> df.memory_usage(deep=True)
Index            128
int64          40000
float64        40000
complex128     80000
object        180000
bool            5000
dtype: int64

Use a Categorical for efficient storage of an object-dtype column with many repeated values.

>>> df['object'].astype('category').memory_usage(deep=True)
5244
transpose(*args, copy=False)[source]

Transpose index and columns.

Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property T is an accessor to the method transpose().

Parameters:
  • *args (tuple, optional) – Accepted for compatibility with NumPy.

  • copy (bool, default False) –

    Whether to copy the data after transposing, even for DataFrames with a single dtype.

    Note that a copy is always required for mixed dtype DataFrames, or for DataFrames with any extension types.

Returns:

The transposed DataFrame.

Return type:

DataFrame

See also

numpy.transpose

Permute the dimensions of a given array.

Notes

Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object dtype. In such a case, a copy of the data is always made.

Examples

Square DataFrame with homogeneous dtype

>>> d1 = {'col1': [1, 2], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d1)
>>> df1
   col1  col2
0     1     3
1     2     4
>>> df1_transposed = df1.T  # or df1.transpose()
>>> df1_transposed
      0  1
col1  1  2
col2  3  4

When the dtype is homogeneous in the original DataFrame, we get a transposed DataFrame with the same dtype:

>>> df1.dtypes
col1    int64
col2    int64
dtype: object
>>> df1_transposed.dtypes
0    int64
1    int64
dtype: object

Non-square DataFrame with mixed dtypes

>>> d2 = {'name': ['Alice', 'Bob'],
...       'score': [9.5, 8],
...       'employed': [False, True],
...       'kids': [0, 0]}
>>> df2 = pd.DataFrame(data=d2)
>>> df2
    name  score  employed  kids
0  Alice    9.5     False     0
1    Bob    8.0      True     0
>>> df2_transposed = df2.T  # or df2.transpose()
>>> df2_transposed
              0     1
name      Alice   Bob
score       9.5   8.0
employed  False  True
kids          0     0

When the DataFrame has mixed dtypes, we get a transposed DataFrame with the object dtype:

>>> df2.dtypes
name         object
score       float64
employed       bool
kids          int64
dtype: object
>>> df2_transposed.dtypes
0    object
1    object
dtype: object
property T: DataFrame

The transpose of the DataFrame.

Returns:

The transposed DataFrame.

Return type:

DataFrame

See also

DataFrame.transpose

Transpose index and columns.

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4
>>> df.T
      0  1
col1  1  2
col2  3  4
isetitem(loc, value)[source]

Set the given value in the column with position loc.

This is a positional analogue to __setitem__.

Parameters:
  • loc (int or sequence of ints) – Index position for the column.

  • value (scalar or arraylike) – Value(s) for the column.

Return type:

None

Notes

frame.isetitem(loc, value) is an in-place method as it will modify the DataFrame in place (not returning a new object). In contrast to frame.iloc[:, i] = value which will try to update the existing values in place, frame.isetitem(loc, value) will not update the values of the column itself in place, it will instead insert a new array.

In cases where frame.columns is unique, this is equivalent to frame[frame.columns[i]] = value.

query(expr: str, *, inplace: Literal[False] = False, **kwargs) DataFrame[source]
query(expr: str, *, inplace: Literal[True], **kwargs) None
query(expr: str, *, inplace: bool = False, **kwargs) DataFrame | None

Query the columns of a DataFrame with a boolean expression.

Parameters:
  • expr (str) –

    The query string to evaluate.

    You can refer to variables in the environment by prefixing them with an ‘@’ character like @a + b.

    You can refer to column names that are not valid Python variable names by surrounding them in backticks. Thus, column names containing spaces or punctuations (besides underscores) or starting with digits must be surrounded by backticks. (For example, a column named “Area (cm^2)” would be referenced as `Area (cm^2)`). Column names which are Python keywords (like “list”, “for”, “import”, etc) cannot be used.

    For example, if one of your columns is called a a and you want to sum it with b, your query should be `a a` + b.

  • inplace (bool) – Whether to modify the DataFrame rather than creating a new one.

  • **kwargs – See the documentation for eval() for complete details on the keyword arguments accepted by DataFrame.query().

Returns:

DataFrame resulting from the provided query expression or None if inplace=True.

Return type:

DataFrame or None

See also

eval

Evaluate a string describing operations on DataFrame columns.

DataFrame.eval

Evaluate a string describing operations on DataFrame columns.

Notes

The result of the evaluation of this expression is first passed to DataFrame.loc and if that fails because of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.__getitem__().

This method uses the top-level eval() function to evaluate the passed query.

The query() method uses a slightly modified Python syntax by default. For example, the & and | (bitwise) operators have the precedence of their boolean cousins, and and or. This is syntactically valid Python, however the semantics are different.

You can change the semantics of the expression by passing the keyword argument parser='python'. This enforces the same semantics as evaluation in Python space. Likewise, you can pass engine='python' to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to using numexpr as the engine.

The DataFrame.index and DataFrame.columns attributes of the DataFrame instance are placed in the query namespace by default, which allows you to treat both the index and columns of the frame as a column in the frame. The identifier index is used for the frame index; you can also use the name of the index to identify it in a query. Please note that Python keywords may not be used as identifiers.

For further details and examples see the query documentation in indexing.

Backtick quoted variables

Backtick quoted variables are parsed as literal Python code and are converted internally to a Python valid identifier. This can lead to the following problems.

During parsing a number of disallowed characters inside the backtick quoted string are replaced by strings that are allowed as a Python identifier. These characters include all operators in Python, the space character, the question mark, the exclamation mark, the dollar sign, and the euro sign. For other characters that fall outside the ASCII range (U+0001..U+007F) and those that are not further specified in PEP 3131, the query parser will raise an error. This excludes whitespace different than the space character, but also the hashtag (as it is used for comments) and the backtick itself (backtick can also not be escaped).

In a special case, quotes that make a pair around a backtick can confuse the parser. For example, `it's` > `that's` will raise an error, as it forms a quoted string ('s > `that') with a backtick inside.

See also the Python documentation about lexical analysis (https://docs.python.org/3/reference/lexical_analysis.html) in combination with the source code in pandas.core.computation.parsing.

Examples

>>> df = pd.DataFrame({'A': range(1, 6),
...                    'B': range(10, 0, -2),
...                    'C C': range(10, 5, -1)})
>>> df
   A   B  C C
0  1  10   10
1  2   8    9
2  3   6    8
3  4   4    7
4  5   2    6
>>> df.query('A > B')
   A  B  C C
4  5  2    6

The previous expression is equivalent to

>>> df[df.A > df.B]
   A  B  C C
4  5  2    6

For columns with spaces in their name, you can use backtick quoting.

>>> df.query('B == `C C`')
   A   B  C C
0  1  10   10

The previous expression is equivalent to

>>> df[df.B == df['C C']]
   A   B  C C
0  1  10   10
eval(expr: str, *, inplace: Literal[False] = False, **kwargs) Any[source]
eval(expr: str, *, inplace: Literal[True], **kwargs) None

Evaluate a string describing operations on DataFrame columns.

Operates on columns only, not specific rows or elements. This allows eval to run arbitrary code, which can make you vulnerable to code injection if you pass user input to this function.

Parameters:
  • expr (str) – The expression string to evaluate.

  • inplace (bool, default False) – If the expression contains an assignment, whether to perform the operation inplace and mutate the existing DataFrame. Otherwise, a new DataFrame is returned.

  • **kwargs – See the documentation for eval() for complete details on the keyword arguments accepted by query().

Returns:

The result of the evaluation or None if inplace=True.

Return type:

ndarray, scalar, pandas object, or None

See also

DataFrame.query

Evaluates a boolean expression to query the columns of a frame.

DataFrame.assign

Can evaluate an expression or function to create new values for a column.

eval

Evaluate a Python expression as a string using various backends.

Notes

For more details see the API documentation for eval(). For detailed examples see enhancing performance with eval.

Examples

>>> df = pd.DataFrame({'A': range(1, 6), 'B': range(10, 0, -2)})
>>> df
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2
>>> df.eval('A + B')
0    11
1    10
2     9
3     8
4     7
dtype: int64

Assignment is allowed though by default the original DataFrame is not modified.

>>> df.eval('C = A + B')
   A   B   C
0  1  10  11
1  2   8  10
2  3   6   9
3  4   4   8
4  5   2   7
>>> df
   A   B
0  1  10
1  2   8
2  3   6
3  4   4
4  5   2

Multiple columns can be assigned to using multi-line expressions:

>>> df.eval(
...     '''
... C = A + B
... D = A - B
... '''
... )
   A   B   C  D
0  1  10  11 -9
1  2   8  10 -6
2  3   6   9 -3
3  4   4   8  0
4  5   2   7  3
select_dtypes(include=None, exclude=None)[source]

Return a subset of the DataFrame’s columns based on the column dtypes.

Parameters:
  • include (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

  • exclude (scalar or list-like) – A selection of dtypes or strings to be included/excluded. At least one of these parameters must be supplied.

Returns:

The subset of the frame including the dtypes in include and excluding the dtypes in exclude.

Return type:

DataFrame

Raises:

ValueError

  • If both of include and exclude are empty * If include and exclude have overlapping elements * If any kind of string dtype is passed in.

See also

DataFrame.dtypes

Return Series with the data type of each column.

Notes

  • To select all numeric types, use np.number or 'number'

  • To select strings you must use the object dtype, but note that this will return all object dtype columns

  • See the numpy dtype hierarchy

  • To select datetimes, use np.datetime64, 'datetime' or 'datetime64'

  • To select timedeltas, use np.timedelta64, 'timedelta' or 'timedelta64'

  • To select Pandas categorical dtypes, use 'category'

  • To select Pandas datetimetz dtypes, use 'datetimetz' (new in 0.20.0) or 'datetime64[ns, tz]'

Examples

>>> df = pd.DataFrame({'a': [1, 2] * 3,
...                    'b': [True, False] * 3,
...                    'c': [1.0, 2.0] * 3})
>>> df
        a      b  c
0       1   True  1.0
1       2  False  2.0
2       1   True  1.0
3       2  False  2.0
4       1   True  1.0
5       2  False  2.0
>>> df.select_dtypes(include='bool')
   b
0  True
1  False
2  True
3  False
4  True
5  False
>>> df.select_dtypes(include=['float64'])
   c
0  1.0
1  2.0
2  1.0
3  2.0
4  1.0
5  2.0
>>> df.select_dtypes(exclude=['int64'])
       b    c
0   True  1.0
1  False  2.0
2   True  1.0
3  False  2.0
4   True  1.0
5  False  2.0
insert(loc, column, value, allow_duplicates=_NoDefault.no_default)[source]

Insert column into DataFrame at specified location.

Raises a ValueError if column is already contained in the DataFrame, unless allow_duplicates is set to True.

Parameters:
  • loc (int) – Insertion index. Must verify 0 <= loc <= len(columns).

  • column (str, number, or hashable object) – Label of the inserted column.

  • value (Scalar, Series, or array-like) –

  • allow_duplicates (bool, optional, default lib.no_default) –

Return type:

None

See also

Index.insert

Insert new item by index.

Examples

>>> df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
>>> df
   col1  col2
0     1     3
1     2     4
>>> df.insert(1, "newcol", [99, 99])
>>> df
   col1  newcol  col2
0     1      99     3
1     2      99     4
>>> df.insert(0, "col1", [100, 100], allow_duplicates=True)
>>> df
   col1  col1  newcol  col2
0   100     1      99     3
1   100     2      99     4

Notice that pandas uses index alignment in case of value from type Series:

>>> df.insert(0, "col0", pd.Series([5, 6], index=[1, 2]))
>>> df
   col0  col1  col1  newcol  col2
0   NaN   100     1      99     3
1   5.0   100     2      99     4
assign(**kwargs)[source]

Assign new columns to a DataFrame.

Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.

Parameters:

**kwargs (dict of {str: callable or Series}) – The column names are keywords. If the values are callable, they are computed on the DataFrame and assigned to the new columns. The callable must not change input DataFrame (though pandas doesn’t check it). If the values are not callable, (e.g. a Series, scalar, or array), they are simply assigned.

Returns:

A new DataFrame with the new columns in addition to all the existing columns.

Return type:

DataFrame

Notes

Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.

Examples

>>> df = pd.DataFrame({'temp_c': [17.0, 25.0]},
...                   index=['Portland', 'Berkeley'])
>>> df
          temp_c
Portland    17.0
Berkeley    25.0

Where the value is a callable, evaluated on df:

>>> df.assign(temp_f=lambda x: x.temp_c * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:

>>> df.assign(temp_f=df['temp_c'] * 9 / 5 + 32)
          temp_c  temp_f
Portland    17.0    62.6
Berkeley    25.0    77.0

You can create multiple columns within the same assign where one of the columns depends on another one defined within the same assign:

>>> df.assign(temp_f=lambda x: x['temp_c'] * 9 / 5 + 32,
...           temp_k=lambda x: (x['temp_f'] + 459.67) * 5 / 9)
          temp_c  temp_f  temp_k
Portland    17.0    62.6  290.15
Berkeley    25.0    77.0  298.15
align(other, join='outer', axis=None, level=None, copy=None, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)[source]

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters:
  • other (DataFrame or Series) –

  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') –

  • axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).

  • level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

  • fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed Series:

    • pad / ffill: propagate last valid observation forward to next valid.

    • backfill / bfill: use NEXT valid observation to fill gap.

  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

  • fill_axis ({0 or 'index', 1 or 'columns'}, default 0) – Filling axis, method and limit.

  • broadcast_axis ({0 or 'index', 1 or 'columns'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.

Returns:

Aligned objects.

Return type:

tuple of (DataFrame, type of other)

Examples

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)
>>> left
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)
>>> left
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default axis=None will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)
>>> left
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
set_axis(labels, *, axis=0, copy=None)[source]

Assign desired index to given axis.

Indexes for column or row labels can be changed by assigning a list-like or Index.

Parameters:
  • labels (list-like, Index) – The values for the new index.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to update. The value 0 identifies the rows. For Series this parameter is unused and defaults to 0.

  • copy (bool, default True) –

    Whether to make a copy of the underlying data.

    New in version 1.5.0.

Returns:

An object of type DataFrame.

Return type:

DataFrame

See also

DataFrame.rename_axis

Alter the name of the index or columns. Examples ——– >>> df = pd.DataFrame({“A”: [1, 2, 3], “B”: [4, 5, 6]}) Change the row labels. >>> df.set_axis([‘a’, ‘b’, ‘c’], axis=’index’) A B a 1 4 b 2 5 c 3 6 Change the column labels. >>> df.set_axis([‘I’, ‘II’], axis=’columns’) I II 0 1 4 1 2 5 2 3 6

reindex(labels=None, *, index=None, columns=None, axis=None, method=None, copy=None, level=None, fill_value=nan, limit=None, tolerance=None)[source]

Conform DataFrame to new index with optional filling logic.

Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters:
  • labels (array-like, optional) – New labels / index to conform the axis specified by ‘axis’ to.

  • index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.

  • columns (array-like, optional) – New labels for the columns. Preferably an Index object to avoid duplicating data.

  • axis (int or str, optional) – Axis to target. Can be either the axis name (‘index’, ‘columns’) or number (0, 1).

  • method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –

    Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

    • None (default): don’t fill gaps

    • pad / ffill: Propagate last valid observation forward to next valid.

    • backfill / bfill: Use next valid observation to fill gap.

    • nearest: Use nearest valid observations to fill gap.

  • copy (bool, default True) – Return a new object, even if the passed indexes are the same.

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.

  • tolerance (optional) –

    Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.

    Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

Return type:

DataFrame with changed index.

See also

DataFrame.set_index

Set row labels.

DataFrame.reset_index

Remove row labels or move them to new columns.

DataFrame.reindex_like

Change to same indices as other DataFrame.

Examples

DataFrame.reindex supports two calling conventions

  • (index=index_labels, columns=column_labels, ...)

  • (labels, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Create a dataframe with some fictional data.

>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
>>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
...                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
...                   index=index)
>>> df
           http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00

Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.

>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
...              'Chrome']
>>> df.reindex(new_index)
               http_status  response_time
Safari               404.0           0.07
Iceweasel              NaN            NaN
Comodo Dragon          NaN            NaN
IE10                 404.0           0.08
Chrome               200.0           0.02

We can fill in the missing values by passing a value to the keyword fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keyword method to fill the NaN values.

>>> df.reindex(new_index, fill_value=0)
               http_status  response_time
Safari                 404           0.07
Iceweasel                0           0.00
Comodo Dragon            0           0.00
IE10                   404           0.08
Chrome                 200           0.02
>>> df.reindex(new_index, fill_value='missing')
              http_status response_time
Safari                404          0.07
Iceweasel         missing       missing
Comodo Dragon     missing       missing
IE10                  404          0.08
Chrome                200          0.02

We can also reindex the columns.

>>> df.reindex(columns=['http_status', 'user_agent'])
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

Or we can use “axis-style” keyword arguments

>>> df.reindex(['http_status', 'user_agent'], axis="columns")
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

To further illustrate the filling functionality in reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).

>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D')
>>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},
...                    index=date_index)
>>> df2
            prices
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0

Suppose we decide to expand the dataframe to cover a wider date range.

>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
>>> df2.reindex(date_index2)
            prices
2009-12-29     NaN
2009-12-30     NaN
2009-12-31     NaN
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0
2010-01-07     NaN

The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with NaN. If desired, we can fill in the missing values using one of several options.

For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an argument to the method keyword.

>>> df2.reindex(date_index2, method='bfill')
            prices
2009-12-29   100.0
2009-12-30   100.0
2009-12-31   100.0
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0
2010-01-07     NaN

Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in the NaN values present in the original dataframe, use the fillna() method.

See the user guide for more.

drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable = None, inplace: Literal[True], errors: Literal['ignore', 'raise'] = 'raise') None[source]
drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable = None, inplace: Literal[False] = False, errors: Literal['ignore', 'raise'] = 'raise') DataFrame
drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable = None, inplace: bool = False, errors: Literal['ignore', 'raise'] = 'raise') DataFrame | None

Drop specified labels from rows or columns.

Remove rows or columns by specifying label names and corresponding axis, or by specifying directly index or column names. When using a multi-index, labels on different levels can be removed by specifying the level. See the user guide for more information about the now unused levels.

Parameters:
  • labels (single label or list-like) – Index or column labels to drop. A tuple will be used as a single label and not treated as a list-like.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Whether to drop labels from the index (0 or ‘index’) or columns (1 or ‘columns’).

  • index (single label or list-like) – Alternative to specifying axis (labels, axis=0 is equivalent to index=labels).

  • columns (single label or list-like) – Alternative to specifying axis (labels, axis=1 is equivalent to columns=labels).

  • level (int or level name, optional) – For MultiIndex, level from which the labels will be removed.

  • inplace (bool, default False) – If False, return a copy. Otherwise, do operation inplace and return None.

  • errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.

Returns:

DataFrame without the removed index or column labels or None if inplace=True.

Return type:

DataFrame or None

Raises:

KeyError – If any of the labels is not found in the selected axis.

See also

DataFrame.loc

Label-location based indexer for selection by label.

DataFrame.dropna

Return DataFrame with labels on given axis omitted where (all or any) data are missing.

DataFrame.drop_duplicates

Return DataFrame with duplicate rows removed, optionally only considering certain columns.

Series.drop

Return Series with specified index labels removed.

Examples

>>> df = pd.DataFrame(np.arange(12).reshape(3, 4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

Drop columns

>>> df.drop(['B', 'C'], axis=1)
   A   D
0  0   3
1  4   7
2  8  11
>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

Drop a row by index

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

Drop columns and/or rows of MultiIndex DataFrame

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3, 0.2]])
>>> df
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

Drop a specific index combination from the MultiIndex DataFrame, i.e., drop the combination 'falcon' and 'weight', which deletes only the corresponding row

>>> df.drop(index=('falcon', 'weight'))
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        length  0.3     0.2
>>> df.drop(index='cow', columns='small')
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3
>>> df.drop(index='length', level=1)
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8
rename(mapper: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, *, index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, columns: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool | None = None, inplace: Literal[True], level: Hashable = None, errors: Literal['ignore', 'raise'] = 'ignore') None[source]
rename(mapper: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, *, index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, columns: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool | None = None, inplace: Literal[False] = False, level: Hashable = None, errors: Literal['ignore', 'raise'] = 'ignore') DataFrame
rename(mapper: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, *, index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, columns: Mapping[Any, Hashable] | Callable[[Any], Hashable] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool | None = None, inplace: bool = False, level: Hashable = None, errors: Literal['ignore', 'raise'] = 'ignore') DataFrame | None

Rename columns or index labels.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

See the user guide for more.

Parameters:
  • mapper (dict-like or function) – Dict-like or function transformations to apply to that axis’ values. Use either mapper and axis to specify the axis to target with mapper, or index and columns.

  • index (dict-like or function) – Alternative to specifying axis (mapper, axis=0 is equivalent to index=mapper).

  • columns (dict-like or function) – Alternative to specifying axis (mapper, axis=1 is equivalent to columns=mapper).

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to target with mapper. Can be either the axis name (‘index’, ‘columns’) or number (0, 1). The default is ‘index’.

  • copy (bool, default True) – Also copy underlying data.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one. If True then value of copy is ignored.

  • level (int or level name, default None) – In case of a MultiIndex, only rename labels in the specified level.

  • errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise a KeyError when a dict-like mapper, index, or columns contains labels that are not present in the Index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.

Returns:

DataFrame with the renamed axis labels or None if inplace=True.

Return type:

DataFrame or None

Raises:

KeyError – If any of the labels is not found in the selected axis and “errors=’raise’”.

See also

DataFrame.rename_axis

Set the name of the axis.

Examples

DataFrame.rename supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)

  • (mapper, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Rename columns using a mapping:

>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
>>> df.rename(columns={"A": "a", "B": "c"})
   a  c
0  1  4
1  2  5
2  3  6

Rename index using a mapping:

>>> df.rename(index={0: "x", 1: "y", 2: "z"})
   A  B
x  1  4
y  2  5
z  3  6

Cast index labels to a different type:

>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')
>>> df.rename(columns={"A": "a", "B": "b", "C": "c"}, errors="raise")
Traceback (most recent call last):
KeyError: ['C'] not found in axis

Using axis-style parameters:

>>> df.rename(str.lower, axis='columns')
   a  b
0  1  4
1  2  5
2  3  6
>>> df.rename({1: 2, 2: 4}, axis='index')
   A  B
0  1  4
2  2  5
4  3  6
fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[False] = False, limit: int | None = None, downcast: dict | None = None) DataFrame[source]
fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[True], limit: int | None = None, downcast: dict | None = None) None
fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: bool = False, limit: int | None = None, downcast: dict | None = None) DataFrame | None

Fill NA/NaN values using the specified method.

Parameters:
  • value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

  • method ({'backfill', 'bfill', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed Series:

    • ffill: propagate last valid observation forward to next valid.

    • backfill / bfill: use next valid observation to fill gap.

  • axis ({0 or 'index', 1 or 'columns'}) – Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.

  • inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

  • downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns:

Object with missing values filled or None if inplace=True.

Return type:

DataFrame or None

See also

interpolate

Fill NaN values using interpolation.

reindex

Conform object to new index.

asfreq

Convert TimeSeries to specified frequency.

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list("ABCD"))
>>> df
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

We can also propagate non-null values forward or backward.

>>> df.fillna(method="ffill")
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
>>> df.fillna(value=values)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
>>> df.fillna(df2)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.

pop(item)[source]

Return item and drop from frame. Raise KeyError if not found.

Parameters:

item (label) – Label of column to be popped.

Return type:

Series

Examples

>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=('name', 'class', 'max_speed'))
>>> df
     name   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN
>>> df.pop('class')
0      bird
1      bird
2    mammal
3    mammal
Name: class, dtype: object
>>> df
     name  max_speed
0  falcon      389.0
1  parrot       24.0
2    lion       80.5
3  monkey        NaN
replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[False] = False, limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) DataFrame[source]
replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[True], limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) None

Replace values given in to_replace with value.

Values of the DataFrame are replaced with other values dynamically.

This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters:
  • to_replace (str, regex, list, dict, Series, int, float, or None) –

    How to find the values that will be replaced.

    • numeric, str or regex:

      • numeric: numeric values equal to to_replace will be replaced with value

      • str: string exactly matching to_replace will be replaced with value

      • regex: regexs matching to_replace will be replaced with value

    • list of str, regex, or numeric:

      • First, if to_replace and value are both lists, they must be the same length.

      • Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.

      • str, regex and numeric rules apply as above.

    • dict:

      • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.

      • For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.

      • For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.

    • None:

      • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

    See the examples section for examples of each of these.

  • value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • limit (int, default None) – Maximum size gap to forward or backward fill.

  • regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

  • method ({'pad', 'ffill', 'bfill'}) – The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Returns:

Object after replacement.

Return type:

DataFrame

Raises:
  • AssertionError

    • If regex is not a bool and to_replace is not None.

  • TypeError

    • If to_replace is not a scalar, array-like, dict, or None * If to_replace is a dict and value is not a list, dict, ndarray, or Series * If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series. * When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

  • ValueError

    • If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

DataFrame.fillna

Fill NA values.

DataFrame.where

Replace values based on boolean condition.

Series.str.replace

Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.

  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.

  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.

  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s.replace(1, 5)
0    5
1    2
2    3
3    4
4    5
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
    A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)
    A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
    A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')
0    3
1    3
2    3
3    4
4    5
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})
        A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
        A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})
        A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
        A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
        A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
        A    B
0   new  abc
1   new  new
2  bait  xyz

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When value is not explicitly passed and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case.

>>> s.replace('a')
0    10
1    10
2    10
3     b
4     b
dtype: object

On the other hand, if None is explicitly passed for value, it will be respected:

>>> s.replace('a', None)
0      10
1    None
2    None
3       b
4    None
dtype: object

Changed in version 1.4.0: Previously the explicit None was silently ignored.

shift(periods=1, freq=None, axis=0, fill_value=_NoDefault.no_default)[source]

Shift index by desired number of periods with an optional time freq.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.

Parameters:
  • periods (int) – Number of periods to shift. Can be positive or negative.

  • freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For Series this parameter is unused and defaults to 0.

  • fill_value (object, optional) –

    The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

    Changed in version 1.1.0.

Returns:

Copy of input object, shifted.

Return type:

DataFrame

See also

Index.shift

Shift values of Index.

DatetimeIndex.shift

Shift values of DatetimeIndex.

PeriodIndex.shift

Shift values of PeriodIndex.

Examples

>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45],
...                    "Col2": [13, 23, 18, 33, 48],
...                    "Col3": [17, 27, 22, 37, 52]},
...                   index=pd.date_range("2020-01-01", "2020-01-05"))
>>> df
            Col1  Col2  Col3
2020-01-01    10    13    17
2020-01-02    20    23    27
2020-01-03    15    18    22
2020-01-04    30    33    37
2020-01-05    45    48    52
>>> df.shift(periods=3)
            Col1  Col2  Col3
2020-01-01   NaN   NaN   NaN
2020-01-02   NaN   NaN   NaN
2020-01-03   NaN   NaN   NaN
2020-01-04  10.0  13.0  17.0
2020-01-05  20.0  23.0  27.0
>>> df.shift(periods=1, axis="columns")
            Col1  Col2  Col3
2020-01-01   NaN    10    13
2020-01-02   NaN    20    23
2020-01-03   NaN    15    18
2020-01-04   NaN    30    33
2020-01-05   NaN    45    48
>>> df.shift(periods=3, fill_value=0)
            Col1  Col2  Col3
2020-01-01     0     0     0
2020-01-02     0     0     0
2020-01-03     0     0     0
2020-01-04    10    13    17
2020-01-05    20    23    27
>>> df.shift(periods=3, freq="D")
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
>>> df.shift(periods=3, freq="infer")
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
set_index(keys, *, drop: bool = True, append: bool = False, inplace: Literal[False] = False, verify_integrity: bool = False) DataFrame[source]
set_index(keys, *, drop: bool = True, append: bool = False, inplace: Literal[True], verify_integrity: bool = False) None

Set the DataFrame index using existing columns.

Set the DataFrame index (row labels) using one or more existing columns or arrays (of the correct length). The index can replace the existing index or expand on it.

Parameters:
  • keys (label or array-like or list of labels/arrays) – This parameter can be either a single column key, a single array of the same length as the calling DataFrame, or a list containing an arbitrary combination of column keys and arrays. Here, “array” encompasses Series, Index, np.ndarray, and instances of Iterator.

  • drop (bool, default True) – Delete columns to be used as the new index.

  • append (bool, default False) – Whether to append columns to existing index.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • verify_integrity (bool, default False) – Check the new index for duplicates. Otherwise defer the check until necessary. Setting to False will improve the performance of this method.

Returns:

Changed row labels or None if inplace=True.

Return type:

DataFrame or None

See also

DataFrame.reset_index

Opposite of set_index.

DataFrame.reindex

Change to new indices or expand indices.

DataFrame.reindex_like

Change to same indices as other DataFrame.

Examples

>>> df = pd.DataFrame({'month': [1, 4, 7, 10],
...                    'year': [2012, 2014, 2013, 2014],
...                    'sale': [55, 40, 84, 31]})
>>> df
   month  year  sale
0      1  2012    55
1      4  2014    40
2      7  2013    84
3     10  2014    31

Set the index to become the ‘month’ column:

>>> df.set_index('month')
       year  sale
month
1      2012    55
4      2014    40
7      2013    84
10     2014    31

Create a MultiIndex using columns ‘year’ and ‘month’:

>>> df.set_index(['year', 'month'])
            sale
year  month
2012  1     55
2014  4     40
2013  7     84
2014  10    31

Create a MultiIndex using an Index and a column:

>>> df.set_index([pd.Index([1, 2, 3, 4]), 'year'])
         month  sale
   year
1  2012  1      55
2  2014  4      40
3  2013  7      84
4  2014  10     31

Create a MultiIndex using two Series:

>>> s = pd.Series([1, 2, 3, 4])
>>> df.set_index([s, s**2])
      month  year  sale
1 1       1  2012    55
2 4       4  2014    40
3 9       7  2013    84
4 16     10  2014    31
reset_index(level: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, *, drop: bool = False, inplace: ~typing.Literal[False] = False, col_level: ~typing.Hashable = 0, col_fill: ~typing.Hashable = '', allow_duplicates: bool | ~typing.Literal[<no_default>] = _NoDefault.no_default, names: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None) DataFrame[source]
reset_index(level: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, *, drop: bool = False, inplace: ~typing.Literal[True], col_level: ~typing.Hashable = 0, col_fill: ~typing.Hashable = '', allow_duplicates: bool | ~typing.Literal[<no_default>] = _NoDefault.no_default, names: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None) None
reset_index(level: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, *, drop: bool = False, inplace: bool = False, col_level: ~typing.Hashable = 0, col_fill: ~typing.Hashable = '', allow_duplicates: bool | ~typing.Literal[<no_default>] = _NoDefault.no_default, names: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None) DataFrame | None

Reset the index, or a level of it.

Reset the index of the DataFrame, and use the default one instead. If the DataFrame has a MultiIndex, this method can remove one or more levels.

Parameters:
  • level (int, str, tuple, or list, default None) – Only remove the given levels from the index. Removes all levels by default.

  • drop (bool, default False) – Do not try to insert index into dataframe columns. This resets the index to the default integer index.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • col_level (int or str, default 0) – If the columns have multiple levels, determines which level the labels are inserted into. By default it is inserted into the first level.

  • col_fill (object, default '') – If the columns have multiple levels, determines how the other levels are named. If None then the index name is repeated.

  • allow_duplicates (bool, optional, default lib.no_default) –

    Allow duplicate column labels to be created.

    New in version 1.5.0.

  • names (int, str or 1-dimensional list, default None) –

    Using the given string, rename the DataFrame column which contains the index data. If the DataFrame has a MultiIndex, this has to be a list or tuple with length equal to the number of levels.

    New in version 1.5.0.

Returns:

DataFrame with the new index or None if inplace=True.

Return type:

DataFrame or None

See also

DataFrame.set_index

Opposite of reset_index.

DataFrame.reindex

Change to new indices or expand indices.

DataFrame.reindex_like

Change to same indices as other DataFrame.

Examples

>>> df = pd.DataFrame([('bird', 389.0),
...                    ('bird', 24.0),
...                    ('mammal', 80.5),
...                    ('mammal', np.nan)],
...                   index=['falcon', 'parrot', 'lion', 'monkey'],
...                   columns=('class', 'max_speed'))
>>> df
         class  max_speed
falcon    bird      389.0
parrot    bird       24.0
lion    mammal       80.5
monkey  mammal        NaN

When we reset the index, the old index is added as a column, and a new sequential index is used:

>>> df.reset_index()
    index   class  max_speed
0  falcon    bird      389.0
1  parrot    bird       24.0
2    lion  mammal       80.5
3  monkey  mammal        NaN

We can use the drop parameter to avoid the old index being added as a column:

>>> df.reset_index(drop=True)
    class  max_speed
0    bird      389.0
1    bird       24.0
2  mammal       80.5
3  mammal        NaN

You can also use reset_index with MultiIndex.

>>> index = pd.MultiIndex.from_tuples([('bird', 'falcon'),
...                                    ('bird', 'parrot'),
...                                    ('mammal', 'lion'),
...                                    ('mammal', 'monkey')],
...                                   names=['class', 'name'])
>>> columns = pd.MultiIndex.from_tuples([('speed', 'max'),
...                                      ('species', 'type')])
>>> df = pd.DataFrame([(389.0, 'fly'),
...                    (24.0, 'fly'),
...                    (80.5, 'run'),
...                    (np.nan, 'jump')],
...                   index=index,
...                   columns=columns)
>>> df
               speed species
                 max    type
class  name
bird   falcon  389.0     fly
       parrot   24.0     fly
mammal lion     80.5     run
       monkey    NaN    jump

Using the names parameter, choose a name for the index column:

>>> df.reset_index(names=['classes', 'names'])
  classes   names  speed species
                     max    type
0    bird  falcon  389.0     fly
1    bird  parrot   24.0     fly
2  mammal    lion   80.5     run
3  mammal  monkey    NaN    jump

If the index has multiple levels, we can reset a subset of them:

>>> df.reset_index(level='class')
         class  speed species
                  max    type
name
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump

If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:

>>> df.reset_index(level='class', col_level=1)
                speed species
         class    max    type
name
falcon    bird  389.0     fly
parrot    bird   24.0     fly
lion    mammal   80.5     run
monkey  mammal    NaN    jump

When the index is inserted under another level, we can specify under which one with the parameter col_fill:

>>> df.reset_index(level='class', col_level=1, col_fill='species')
              species  speed species
                class    max    type
name
falcon           bird  389.0     fly
parrot           bird   24.0     fly
lion           mammal   80.5     run
monkey         mammal    NaN    jump

If we specify a nonexistent level for col_fill, it is created:

>>> df.reset_index(level='class', col_level=1, col_fill='genus')
                genus  speed species
                class    max    type
name
falcon           bird  389.0     fly
parrot           bird   24.0     fly
lion           mammal   80.5     run
monkey         mammal    NaN    jump
isna()[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

Return type:

DataFrame

See also

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isnull()[source]

DataFrame.isnull is an alias for DataFrame.isna.

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is an NA value.

Return type:

DataFrame

See also

DataFrame.isnull

Alias of isna.

DataFrame.notna

Boolean inverse of isna.

DataFrame.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notna()[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

Return type:

DataFrame

See also

DataFrame.notnull

Alias of notna.

DataFrame.isna

Boolean inverse of notna.

DataFrame.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notnull()[source]

DataFrame.notnull is an alias for DataFrame.notna.

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in DataFrame that indicates whether an element is not an NA value.

Return type:

DataFrame

See also

DataFrame.notnull

Alias of notna.

DataFrame.isna

Boolean inverse of notna.

DataFrame.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
dtype: bool
dropna(*, axis: int | ~typing.Literal['index', 'columns', 'rows'] = 0, how: ~typing.Literal['any', 'all'] | ~typing.Literal[<no_default>] = _NoDefault.no_default, thresh: int | ~typing.Literal[<no_default>] = _NoDefault.no_default, subset: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, inplace: ~typing.Literal[False] = False, ignore_index: bool = False) DataFrame[source]
dropna(*, axis: int | ~typing.Literal['index', 'columns', 'rows'] = 0, how: ~typing.Literal['any', 'all'] | ~typing.Literal[<no_default>] = _NoDefault.no_default, thresh: int | ~typing.Literal[<no_default>] = _NoDefault.no_default, subset: ~typing.Hashable | ~typing.Sequence[~typing.Hashable] = None, inplace: ~typing.Literal[True], ignore_index: bool = False) None

Remove missing values.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Determine if rows or columns which contain missing values are removed.

    • 0, or ‘index’ : Drop rows which contain missing values.

    • 1, or ‘columns’ : Drop columns which contain missing value.

    Pass tuple or list to drop on multiple axes. Only a single axis is allowed.

  • how ({'any', 'all'}, default 'any') –

    Determine if row or column is removed from DataFrame, when we have at least one NA or all NA.

    • ’any’ : If any NA values are present, drop that row or column.

    • ’all’ : If all values are NA, drop that row or column.

  • thresh (int, optional) – Require that many non-NA values. Cannot be combined with how.

  • subset (column label or sequence of labels, optional) – Labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 2.0.0.

Returns:

DataFrame with NA entries dropped from it or None if inplace=True.

Return type:

DataFrame or None

See also

DataFrame.isna

Indicate missing values.

DataFrame.notna

Indicate existing (non-missing) values.

DataFrame.fillna

Replace missing values.

Series.dropna

Drop missing values.

Index.dropna

Drop missing indices.

Examples

>>> df = pd.DataFrame({"name": ['Alfred', 'Batman', 'Catwoman'],
...                    "toy": [np.nan, 'Batmobile', 'Bullwhip'],
...                    "born": [pd.NaT, pd.Timestamp("1940-04-25"),
...                             pd.NaT]})
>>> df
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Drop the rows where at least one element is missing.

>>> df.dropna()
     name        toy       born
1  Batman  Batmobile 1940-04-25

Drop the columns where at least one element is missing.

>>> df.dropna(axis='columns')
       name
0    Alfred
1    Batman
2  Catwoman

Drop the rows where all elements are missing.

>>> df.dropna(how='all')
       name        toy       born
0    Alfred        NaN        NaT
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Keep only the rows with at least 2 non-NA values.

>>> df.dropna(thresh=2)
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT

Define in which columns to look for missing values.

>>> df.dropna(subset=['name', 'toy'])
       name        toy       born
1    Batman  Batmobile 1940-04-25
2  Catwoman   Bullwhip        NaT
drop_duplicates(subset=None, *, keep='first', inplace=False, ignore_index=False)[source]

Return DataFrame with duplicate rows removed.

Considering certain columns is optional. Indexes, including time indexes are ignored.

Parameters:
  • subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep ({‘first’, ‘last’, False}, default ‘first’) –

    Determines which duplicates (if any) to keep.

    • ’first’ : Drop duplicates except for the first occurrence.

    • ’last’ : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

Returns:

DataFrame with duplicates removed or None if inplace=True.

Return type:

DataFrame or None

See also

DataFrame.value_counts

Count unique combinations of columns.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, it removes duplicate rows based on all columns.

>>> df.drop_duplicates()
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

To remove duplicates on specific column(s), use subset.

>>> df.drop_duplicates(subset=['brand'])
    brand style  rating
0  Yum Yum   cup     4.0
2  Indomie   cup     3.5

To remove duplicates and keep last occurrences, use keep.

>>> df.drop_duplicates(subset=['brand', 'style'], keep='last')
    brand style  rating
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
4  Indomie  pack     5.0
duplicated(subset=None, keep='first')[source]

Return boolean Series denoting duplicate rows.

Considering certain columns is optional.

Parameters:
  • subset (column label or sequence of labels, optional) – Only consider certain columns for identifying duplicates, by default use all of the columns.

  • keep ({'first', 'last', False}, default 'first') –

    Determines which duplicates (if any) to mark.

    • first : Mark duplicates as True except for the first occurrence.

    • last : Mark duplicates as True except for the last occurrence.

    • False : Mark all duplicates as True.

Returns:

Boolean series for each duplicated rows.

Return type:

Series

See also

Index.duplicated

Equivalent method on index.

Series.duplicated

Equivalent method on Series.

Series.drop_duplicates

Remove duplicate values from Series.

DataFrame.drop_duplicates

Remove duplicate values from DataFrame.

Examples

Consider dataset containing ramen rating.

>>> df = pd.DataFrame({
...     'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
...     'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
...     'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
    brand style  rating
0  Yum Yum   cup     4.0
1  Yum Yum   cup     4.0
2  Indomie   cup     3.5
3  Indomie  pack    15.0
4  Indomie  pack     5.0

By default, for each set of duplicated values, the first occurrence is set on False and all others on True.

>>> df.duplicated()
0    False
1     True
2    False
3    False
4    False
dtype: bool

By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.

>>> df.duplicated(keep='last')
0     True
1    False
2    False
3    False
4    False
dtype: bool

By setting keep on False, all duplicates are True.

>>> df.duplicated(keep=False)
0     True
1     True
2    False
3    False
4    False
dtype: bool

To find duplicates on specific column(s), use subset.

>>> df.duplicated(subset=['brand'])
0    False
1     True
2    False
3     True
4     True
dtype: bool
sort_values(by: Hashable | Sequence[Hashable], *, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending=True, inplace: Literal[False] = False, kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) DataFrame[source]
sort_values(by: Hashable | Sequence[Hashable], *, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending=True, inplace: Literal[True], kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) None

Sort by the values along either axis.

Parameters:
  • by (str or list of str) –

    Name or list of names to sort by.

    • if axis is 0 or ‘index’ then by may contain index levels and/or column labels.

    • if axis is 1 or ‘columns’ then by may contain column levels and/or index labels.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Axis to be sorted.

  • ascending (bool or list of bool, default True) – Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by.

  • inplace (bool, default False) – If True, perform operation in-place.

  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also numpy.sort() for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.

  • na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

  • key (callable, optional) –

    Apply the key function to the values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return a Series with the same shape as the input. It will be applied to each column in by independently.

    New in version 1.1.0.

Returns:

DataFrame with sorted values or None if inplace=True.

Return type:

DataFrame or None

See also

DataFrame.sort_index

Sort a DataFrame by the index.

Series.sort_values

Similar method for a Series.

Examples

>>> df = pd.DataFrame({
...     'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
...     'col2': [2, 1, 9, 8, 7, 4],
...     'col3': [0, 1, 9, 4, 2, 3],
...     'col4': ['a', 'B', 'c', 'D', 'e', 'F']
... })
>>> df
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

Sort by col1

>>> df.sort_values(by=['col1'])
  col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

Sort by multiple columns

>>> df.sort_values(by=['col1', 'col2'])
  col1  col2  col3 col4
1    A     1     1    B
0    A     2     0    a
2    B     9     9    c
5    C     4     3    F
4    D     7     2    e
3  NaN     8     4    D

Sort Descending

>>> df.sort_values(by='col1', ascending=False)
  col1  col2  col3 col4
4    D     7     2    e
5    C     4     3    F
2    B     9     9    c
0    A     2     0    a
1    A     1     1    B
3  NaN     8     4    D

Putting NAs first

>>> df.sort_values(by='col1', ascending=False, na_position='first')
  col1  col2  col3 col4
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F
2    B     9     9    c
0    A     2     0    a
1    A     1     1    B

Sorting with a key function

>>> df.sort_values(by='col4', key=lambda col: col.str.lower())
   col1  col2  col3 col4
0    A     2     0    a
1    A     1     1    B
2    B     9     9    c
3  NaN     8     4    D
4    D     7     2    e
5    C     4     3    F

Natural sort with the key argument, using the natsort <https://github.com/SethMMorton/natsort> package.

>>> df = pd.DataFrame({
...    "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
...    "value": [10, 20, 30, 40, 50]
... })
>>> df
    time  value
0    0hr     10
1  128hr     20
2   72hr     30
3   48hr     40
4   96hr     50
>>> from natsort import index_natsorted
>>> df.sort_values(
...     by="time",
...     key=lambda x: np.argsort(index_natsorted(df["time"]))
... )
    time  value
0    0hr     10
3   48hr     40
2   72hr     30
4   96hr     50
1  128hr     20
sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) None[source]
sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) DataFrame
sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) DataFrame | None

Sort object by labels (along an axis).

Returns a new DataFrame sorted by label if inplace argument is False, otherwise updates the original DataFrame and returns None.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.

  • level (int or level name or list of ints or list of level names) – If not None, sort on values in specified index level(s).

  • ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

  • inplace (bool, default False) – Whether to modify the DataFrame rather than creating a new one.

  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also numpy.sort() for more information. mergesort and stable are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.

  • na_position ({'first', 'last'}, default 'last') – Puts NaNs at the beginning if first; last puts NaNs at the end. Not implemented for MultiIndex.

  • sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape. For MultiIndex inputs, the key is applied per level.

    New in version 1.1.0.

Returns:

The original DataFrame sorted by the labels or None if inplace=True.

Return type:

DataFrame or None

See also

Series.sort_index

Sort Series by the index.

DataFrame.sort_values

Sort DataFrame by the value.

Series.sort_values

Sort Series by the value.

Examples

>>> df = pd.DataFrame([1, 2, 3, 4, 5], index=[100, 29, 234, 1, 150],
...                   columns=['A'])
>>> df.sort_index()
     A
1    4
29   2
100  1
150  5
234  3

By default, it sorts in ascending order, to sort in descending order, use ascending=False

>>> df.sort_index(ascending=False)
     A
234  3
150  5
100  1
29   2
1    4

A key function can be specified which is applied to the index before sorting. For a MultiIndex this is applied to each level separately.

>>> df = pd.DataFrame({"a": [1, 2, 3, 4]}, index=['A', 'b', 'C', 'd'])
>>> df.sort_index(key=lambda x: x.str.lower())
   a
A  1
b  2
C  3
d  4
value_counts(subset=None, normalize=False, sort=True, ascending=False, dropna=True)[source]

Return a Series containing counts of unique rows in the DataFrame.

New in version 1.1.0.

Parameters:
  • subset (label or list of labels, optional) – Columns to use when counting unique combinations.

  • normalize (bool, default False) – Return proportions rather than frequencies.

  • sort (bool, default True) – Sort by frequencies.

  • ascending (bool, default False) – Sort in ascending order.

  • dropna (bool, default True) –

    Don’t include counts of rows that contain NA values.

    New in version 1.3.0.

Return type:

Series

See also

Series.value_counts

Equivalent method on Series.

Notes

The returned Series will have a MultiIndex with one level per input column but an Index (non-multi) for a single label. By default, rows that contain any NA values are omitted from the result. By default, the resulting Series will be in descending order so that the first element is the most frequently-occurring row.

Examples

>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
...                    'num_wings': [2, 0, 0, 0]},
...                   index=['falcon', 'dog', 'cat', 'ant'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0
cat            4          0
ant            6          0
>>> df.value_counts()
num_legs  num_wings
4         0            2
2         2            1
6         0            1
Name: count, dtype: int64
>>> df.value_counts(sort=False)
num_legs  num_wings
2         2            1
4         0            2
6         0            1
Name: count, dtype: int64
>>> df.value_counts(ascending=True)
num_legs  num_wings
2         2            1
6         0            1
4         0            2
Name: count, dtype: int64
>>> df.value_counts(normalize=True)
num_legs  num_wings
4         0            0.50
2         2            0.25
6         0            0.25
Name: proportion, dtype: float64

With dropna set to False we can also count rows with NA values.

>>> df = pd.DataFrame({'first_name': ['John', 'Anne', 'John', 'Beth'],
...                    'middle_name': ['Smith', pd.NA, pd.NA, 'Louise']})
>>> df
  first_name middle_name
0       John       Smith
1       Anne        <NA>
2       John        <NA>
3       Beth      Louise
>>> df.value_counts()
first_name  middle_name
Beth        Louise         1
John        Smith          1
Name: count, dtype: int64
>>> df.value_counts(dropna=False)
first_name  middle_name
Anne        NaN            1
Beth        Louise         1
John        Smith          1
            NaN            1
Name: count, dtype: int64
>>> df.value_counts("first_name")
first_name
John    2
Anne    1
Beth    1
Name: count, dtype: int64
nlargest(n, columns, keep='first')[source]

Return the first n rows ordered by columns in descending order.

Return the first n rows with the largest values in columns, in descending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=False).head(n), but more performant.

Parameters:
  • n (int) – Number of rows to return.

  • columns (label or list of labels) – Column label(s) to order by.

  • keep ({'first', 'last', 'all'}, default 'first') –

    Where there are duplicate values:

    • first : prioritize the first occurrence(s)

    • last : prioritize the last occurrence(s)

    • all : do not drop any duplicates, even it means selecting more than n items.

Returns:

The first n rows ordered by the given columns in descending order.

Return type:

DataFrame

See also

DataFrame.nsmallest

Return the first n rows ordered by columns in ascending order.

DataFrame.sort_values

Sort DataFrame by the values.

DataFrame.head

Return the first n rows without re-ordering.

Notes

This function cannot be used with all column types. For example, when specifying columns with object or category dtypes, TypeError is raised.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,
...                                   434000, 434000, 337000, 11300,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru          11300      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nlargest to select the three rows having the largest values in column “population”.

>>> df.nlargest(3, 'population')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Malta       434000    12011      MT

When using keep='last', ties are resolved in reverse order:

>>> df.nlargest(3, 'population', keep='last')
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN

When using keep='all', all duplicate items are maintained:

>>> df.nlargest(3, 'population', keep='all')
          population      GDP alpha-2
France      65000000  2583560      FR
Italy       59000000  1937894      IT
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN

To order by the largest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.

>>> df.nlargest(3, ['population', 'GDP'])
        population      GDP alpha-2
France    65000000  2583560      FR
Italy     59000000  1937894      IT
Brunei      434000    12128      BN
nsmallest(n, columns, keep='first')[source]

Return the first n rows ordered by columns in ascending order.

Return the first n rows with the smallest values in columns, in ascending order. The columns that are not specified are returned as well, but not used for ordering.

This method is equivalent to df.sort_values(columns, ascending=True).head(n), but more performant.

Parameters:
  • n (int) – Number of items to retrieve.

  • columns (list or str) – Column name or names to order by.

  • keep ({'first', 'last', 'all'}, default 'first') –

    Where there are duplicate values:

    • first : take the first occurrence.

    • last : take the last occurrence.

    • all : do not drop any duplicates, even it means selecting more than n items.

Return type:

DataFrame

See also

DataFrame.nlargest

Return the first n rows ordered by columns in descending order.

DataFrame.sort_values

Sort DataFrame by the values.

DataFrame.head

Return the first n rows without re-ordering.

Examples

>>> df = pd.DataFrame({'population': [59000000, 65000000, 434000,
...                                   434000, 434000, 337000, 337000,
...                                   11300, 11300],
...                    'GDP': [1937894, 2583560 , 12011, 4520, 12128,
...                            17036, 182, 38, 311],
...                    'alpha-2': ["IT", "FR", "MT", "MV", "BN",
...                                "IS", "NR", "TV", "AI"]},
...                   index=["Italy", "France", "Malta",
...                          "Maldives", "Brunei", "Iceland",
...                          "Nauru", "Tuvalu", "Anguilla"])
>>> df
          population      GDP alpha-2
Italy       59000000  1937894      IT
France      65000000  2583560      FR
Malta         434000    12011      MT
Maldives      434000     4520      MV
Brunei        434000    12128      BN
Iceland       337000    17036      IS
Nauru         337000      182      NR
Tuvalu         11300       38      TV
Anguilla       11300      311      AI

In the following example, we will use nsmallest to select the three rows having the smallest values in column “population”.

>>> df.nsmallest(3, 'population')
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS

When using keep='last', ties are resolved in reverse order:

>>> df.nsmallest(3, 'population', keep='last')
          population  GDP alpha-2
Anguilla       11300  311      AI
Tuvalu         11300   38      TV
Nauru         337000  182      NR

When using keep='all', all duplicate items are maintained:

>>> df.nsmallest(3, 'population', keep='all')
          population    GDP alpha-2
Tuvalu         11300     38      TV
Anguilla       11300    311      AI
Iceland       337000  17036      IS
Nauru         337000    182      NR

To order by the smallest values in column “population” and then “GDP”, we can specify multiple columns like in the next example.

>>> df.nsmallest(3, ['population', 'GDP'])
          population  GDP alpha-2
Tuvalu         11300   38      TV
Anguilla       11300  311      AI
Nauru         337000  182      NR
swaplevel(i=-2, j=-1, axis=0)[source]

Swap levels i and j in a MultiIndex.

Default is to swap the two innermost levels of the index.

Parameters:
  • i (int or str) – Levels of the indices to be swapped. Can pass level name as string.

  • j (int or str) – Levels of the indices to be swapped. Can pass level name as string.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to swap levels on. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

Returns:

DataFrame with levels swapped in MultiIndex.

Return type:

DataFrame

Examples

>>> df = pd.DataFrame(
...     {"Grade": ["A", "B", "A", "C"]},
...     index=[
...         ["Final exam", "Final exam", "Coursework", "Coursework"],
...         ["History", "Geography", "History", "Geography"],
...         ["January", "February", "March", "April"],
...     ],
... )
>>> df
                                    Grade
Final exam  History     January      A
            Geography   February     B
Coursework  History     March        A
            Geography   April        C

In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.

>>> df.swaplevel()
                                    Grade
Final exam  January     History         A
            February    Geography       B
Coursework  March       History         A
            April       Geography       C

By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.

>>> df.swaplevel(0)
                                    Grade
January     History     Final exam      A
February    Geography   Final exam      B
March       History     Coursework      A
April       Geography   Coursework      C

We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.

>>> df.swaplevel(0, 1)
                                    Grade
History     Final exam  January         A
Geography   Final exam  February        B
History     Coursework  March           A
Geography   Coursework  April           C
reorder_levels(order, axis=0)[source]

Rearrange index levels using input order. May not drop or duplicate levels.

Parameters:
  • order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Where to reorder levels.

Return type:

DataFrame

Examples

>>> data = {
...     "class": ["Mammals", "Mammals", "Reptiles"],
...     "diet": ["Omnivore", "Carnivore", "Carnivore"],
...     "species": ["Humans", "Dogs", "Snakes"],
... }
>>> df = pd.DataFrame(data, columns=["class", "diet", "species"])
>>> df = df.set_index(["class", "diet"])
>>> df
                                  species
class      diet
Mammals    Omnivore                Humans
           Carnivore                 Dogs
Reptiles   Carnivore               Snakes

Let’s reorder the levels of the index:

>>> df.reorder_levels(["diet", "class"])
                                  species
diet      class
Omnivore  Mammals                  Humans
Carnivore Mammals                    Dogs
          Reptiles                 Snakes
compare(other, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))[source]

Compare to another DataFrame and show the differences.

New in version 1.1.0.

Parameters:
  • other (DataFrame) – Object to compare with.

  • align_axis ({0 or 'index', 1 or 'columns'}, default 1) –

    Determine which axis to align the comparison on.

    • 0, or ‘index’Resulting differences are stacked vertically

      with rows drawn alternately from self and other.

    • 1, or ‘columns’Resulting differences are aligned horizontally

      with columns drawn alternately from self and other.

  • keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.

  • keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.

  • result_names (tuple, default ('self', 'other')) –

    Set the dataframes names in the comparison.

    New in version 1.5.0.

Returns:

DataFrame that shows the differences stacked side by side.

The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.

Return type:

DataFrame

Raises:

ValueError – When the two DataFrames don’t have identical labels or shape.

See also

Series.compare

Compare with another Series and show differences.

DataFrame.equals

Test whether two objects contain the same elements.

Notes

Matching NaNs will not appear as a difference.

Can only compare identically-labeled (i.e. same shape, identical row and column labels) DataFrames

Examples

>>> df = pd.DataFrame(
...     {
...         "col1": ["a", "a", "b", "b", "a"],
...         "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
...         "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
...     },
...     columns=["col1", "col2", "col3"],
... )
>>> df
  col1  col2  col3
0    a   1.0   1.0
1    a   2.0   2.0
2    b   3.0   3.0
3    b   NaN   4.0
4    a   5.0   5.0
>>> df2 = df.copy()
>>> df2.loc[0, 'col1'] = 'c'
>>> df2.loc[2, 'col3'] = 4.0
>>> df2
  col1  col2  col3
0    c   1.0   1.0
1    a   2.0   2.0
2    b   3.0   4.0
3    b   NaN   4.0
4    a   5.0   5.0

Align the differences on columns

>>> df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

Assign result_names

>>> df.compare(df2, result_names=("left", "right"))
  col1       col3
  left right left right
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

Stack the differences on rows

>>> df.compare(df2, align_axis=0)
        col1  col3
0 self     a   NaN
  other    c   NaN
2 self   NaN   3.0
  other  NaN   4.0

Keep the equal values

>>> df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

Keep all original rows and columns

>>> df.compare(df2, keep_shape=True)
  col1       col2       col3
  self other self other self other
0    a     c  NaN   NaN  NaN   NaN
1  NaN   NaN  NaN   NaN  NaN   NaN
2  NaN   NaN  NaN   NaN  3.0   4.0
3  NaN   NaN  NaN   NaN  NaN   NaN
4  NaN   NaN  NaN   NaN  NaN   NaN

Keep all original rows and columns and also all original values

>>> df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0
combine(other, func, fill_value=None, overwrite=True)[source]

Perform column-wise combine with another DataFrame.

Combines a DataFrame with other DataFrame using func to element-wise combine columns. The row and column indexes of the resulting DataFrame will be the union of the two.

Parameters:
  • other (DataFrame) – The DataFrame to merge column-wise.

  • func (function) – Function that takes two series as inputs and return a Series or a scalar. Used to merge the two dataframes column by columns.

  • fill_value (scalar value, default None) – The value to fill NaNs with prior to passing any column to the merge func.

  • overwrite (bool, default True) – If True, columns in self that do not exist in other will be overwritten with NaNs.

Returns:

Combination of the provided DataFrames.

Return type:

DataFrame

See also

DataFrame.combine_first

Combine two DataFrame objects and default to non-null values in frame calling the method.

Examples

Combine using a simple function that chooses the smaller column.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> take_smaller = lambda s1, s2: s1 if s1.sum() < s2.sum() else s2
>>> df1.combine(df2, take_smaller)
   A  B
0  0  3
1  0  3

Example using a true element-wise combine function.

>>> df1 = pd.DataFrame({'A': [5, 0], 'B': [2, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine(df2, np.minimum)
   A  B
0  1  2
1  0  3

Using fill_value fills Nones prior to passing the column to the merge function.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine(df2, take_smaller, fill_value=-5)
   A    B
0  0 -5.0
1  0  4.0

However, if the same element in both dataframes is None, that None is preserved

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [None, 3]})
>>> df1.combine(df2, take_smaller, fill_value=-5)
    A    B
0  0 -5.0
1  0  3.0

Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.

>>> df1 = pd.DataFrame({'A': [0, 0], 'B': [4, 4]})
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [-10, 1], }, index=[1, 2])
>>> df1.combine(df2, take_smaller)
     A    B     C
0  NaN  NaN   NaN
1  NaN  3.0 -10.0
2  NaN  3.0   1.0
>>> df1.combine(df2, take_smaller, overwrite=False)
     A    B     C
0  0.0  NaN   NaN
1  0.0  3.0 -10.0
2  NaN  3.0   1.0

Demonstrating the preference of the passed in dataframe.

>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2])
>>> df2.combine(df1, take_smaller)
   A    B   C
0  0.0  NaN NaN
1  0.0  3.0 NaN
2  NaN  3.0 NaN
>>> df2.combine(df1, take_smaller, overwrite=False)
     A    B   C
0  0.0  NaN NaN
1  0.0  3.0 1.0
2  NaN  3.0 1.0
combine_first(other)[source]

Update null elements with value in the same location in other.

Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two. The resulting dataframe contains the ‘first’ dataframe values and overrides the second one values where both first.loc[index, col] and second.loc[index, col] are not missing values, upon calling first.combine_first(second).

Parameters:

other (DataFrame) – Provided DataFrame to use to fill null values.

Returns:

The result of combining the provided DataFrame with the other object.

Return type:

DataFrame

See also

DataFrame.combine

Perform series-wise operation on two DataFrames using a given function.

Examples

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [None, 4]})
>>> df2 = pd.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine_first(df2)
     A    B
0  1.0  3.0
1  0.0  4.0

Null values still persist if the location of that null value does not exist in other

>>> df1 = pd.DataFrame({'A': [None, 0], 'B': [4, None]})
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])
>>> df1.combine_first(df2)
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0
update(other, join='left', overwrite=True, filter_func=None, errors='ignore')[source]

Modify in place using non-NA values from another DataFrame.

Aligns on indices. There is no return value.

Parameters:
  • other (DataFrame, or object coercible into a DataFrame) – Should have at least one matching index/column label with the original DataFrame. If a Series is passed, its name attribute must be set, and that will be used as the column name to align with the original DataFrame.

  • join ({'left'}, default 'left') – Only left join is implemented, keeping the index and columns of the original object.

  • overwrite (bool, default True) –

    How to handle non-NA values for overlapping keys:

    • True: overwrite original DataFrame’s values with values from other.

    • False: only update values that are NA in the original DataFrame.

  • filter_func (callable(1d-array) -> bool 1d-array, optional) – Can choose to replace values other than NA. Return True for values that should be updated.

  • errors ({'raise', 'ignore'}, default 'ignore') – If ‘raise’, will raise a ValueError if the DataFrame and other both contain non-NA data in the same place.

Returns:

This method directly changes calling object.

Return type:

None

Raises:
  • ValueError

    • When errors=’raise’ and there’s overlapping non-NA data. * When errors is not either ‘ignore’ or ‘raise’

  • NotImplementedError

    • If join != ‘left’

See also

dict.update

Similar method for dictionaries.

DataFrame.merge

For column(s)-on-column(s) operations.

Examples

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, 5, 6],
...                        'C': [7, 8, 9]})
>>> df.update(new_df)
>>> df
   A  B
0  1  4
1  2  5
2  3  6

The DataFrame’s length does not increase as a result of the update, only values at matching index/column labels are updated.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']})
>>> df.update(new_df)
>>> df
   A  B
0  a  d
1  b  e
2  c  f

For Series, its name attribute must be set.

>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_column = pd.Series(['d', 'e'], name='B', index=[0, 2])
>>> df.update(new_column)
>>> df
   A  B
0  a  d
1  b  y
2  c  e
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
...                    'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e']}, index=[1, 2])
>>> df.update(new_df)
>>> df
   A  B
0  a  x
1  b  d
2  c  e

If other contains NaNs the corresponding values are not updated in the original dataframe.

>>> df = pd.DataFrame({'A': [1, 2, 3],
...                    'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, np.nan, 6]})
>>> df.update(new_df)
>>> df
   A    B
0  1    4
1  2  500
2  3    6
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)[source]

Group DataFrame using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
  • by (mapping, function, label, pd.Grouper or list of such) – Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.

  • level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both by and level.

  • as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

  • sort (bool, default True) –

    Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

    Changed in version 2.0.0: Specifying sort=False with an ordered categorical grouper will no longer sort the values.

  • group_keys (bool, default True) –

    When calling apply and the by argument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.

    Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the result from apply is a like-indexed Series or DataFrame. Specify group_keys explicitly to include the group keys or not.

    Changed in version 2.0.0: group_keys now defaults to True.

  • observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

  • dropna (bool, default True) –

    If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

    New in version 1.1.0.

Returns:

Returns a groupby object that contains information about the groups.

Return type:

DataFrameGroupBy

See also

resample

Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

Examples

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
>>> df.groupby(['Animal']).mean()
        Max Speed
Animal
Falcon      375.0
Parrot       25.0

Hierarchical Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> df = pd.DataFrame({'Max Speed': [390., 350., 30., 20.]},
...                   index=index)
>>> df
                Max Speed
Animal Type
Falcon Captive      390.0
       Wild         350.0
Parrot Captive       30.0
       Wild          20.0
>>> df.groupby(level=0).mean()
        Max Speed
Animal
Falcon      370.0
Parrot       25.0
>>> df.groupby(level="Type").mean()
         Max Speed
Type
Captive      210.0
Wild         185.0

We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is True.

>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
    a   c
b
1.0 2   3
2.0 2   5
>>> df.groupby(by=["b"], dropna=False).sum()
    a   c
b
1.0 2   3
2.0 2   5
NaN 1   4
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0
>>> df.groupby(by="a", dropna=False).sum()
    b     c
a
a   13.0   13.0
b   12.3  123.0
NaN 12.3   33.0

When using .apply(), use group_keys to include or exclude the group keys. The group_keys argument defaults to True (include).

>>> df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
...                               'Parrot', 'Parrot'],
...                    'Max Speed': [380., 370., 24., 26.]})
>>> df.groupby("Animal", group_keys=True).apply(lambda x: x)
          Animal  Max Speed
Animal
Falcon 0  Falcon      380.0
       1  Falcon      370.0
Parrot 2  Parrot       24.0
       3  Parrot       26.0
>>> df.groupby("Animal", group_keys=False).apply(lambda x: x)
   Animal  Max Speed
0  Falcon      380.0
1  Falcon      370.0
2  Parrot       24.0
3  Parrot       26.0
pivot(*, columns, index=typing.Literal[<no_default>], values=typing.Literal[<no_default>])[source]

Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the User Guide for more on reshaping.

Parameters:
  • columns (str or object or a list of str) –

    Column to use to make new frame’s columns.

    Changed in version 1.1.0: Also accept list of columns names.

  • index (str or object or a list of str, optional) –

    Column to use to make new frame’s index. If not given, uses existing index.

    Changed in version 1.1.0: Also accept list of index names.

  • values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.

Returns:

Returns reshaped DataFrame.

Return type:

DataFrame

Raises:

ValueError: – When there are any index, columns combinations with multiple values. DataFrame.pivot_table when you need to aggregate.

See also

DataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

DataFrame.unstack

Pivot based on the index values instead of a column.

wide_to_long

Wide panel to long format. Less flexible but more user-friendly than melt.

Notes

For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> df
    foo   bar  baz  zoo
0   one   A    1    x
1   one   B    2    y
2   one   C    3    z
3   two   A    4    q
4   two   B    5    w
5   two   C    6    t
>>> df.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6
>>> df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6
>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
      baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

You could also assign a list of column names or a list of index names.

>>> df = pd.DataFrame({
...        "lev1": [1, 1, 1, 2, 2, 2],
...        "lev2": [1, 1, 2, 1, 1, 2],
...        "lev3": [1, 2, 1, 2, 1, 2],
...        "lev4": [1, 2, 3, 4, 5, 6],
...        "values": [0, 1, 2, 3, 4, 5]})
>>> df
    lev1 lev2 lev3 lev4 values
0   1    1    1    1    0
1   1    1    2    2    1
2   1    2    1    3    2
3   2    1    2    4    3
4   2    1    1    5    4
5   2    2    2    6    5
>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
lev2    1         2
lev3    1    2    1    2
lev1
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0
>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values")
      lev3    1    2
lev1  lev2
   1     1  0.0  1.0
         2  2.0  NaN
   2     1  4.0  3.0
         2  NaN  5.0

A ValueError is raised if there are any duplicates.

>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]})
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4

Notice that the first two rows are the same for our index and columns arguments.

>>> df.pivot(index='foo', columns='bar', values='baz')
Traceback (most recent call last):
   ...
ValueError: Index contains duplicate entries, cannot reshape
pivot_table(values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)[source]

Create a spreadsheet-style pivot table as a DataFrame.

The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

Parameters:
  • values (list-like or scalar, optional) – Column or columns to aggregate.

  • index (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

  • columns (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

  • aggfunc (function, list of functions, dict, default numpy.mean) – If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions. If margin=True, aggfunc will be used to calculate the partial aggregates.

  • fill_value (scalar, default None) – Value to replace missing values with (in the resulting pivot table, after aggregation).

  • margins (bool, default False) – If margins=True, special All columns and rows will be added with partial group aggregates across the categories on the rows and columns.

  • dropna (bool, default True) – Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.

  • margins_name (str, default 'All') – Name of the row / column that will contain the totals when margins is True.

  • observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

  • sort (bool, default True) –

    Specifies if the result should be sorted.

    New in version 1.3.0.

Returns:

An Excel style pivot table.

Return type:

DataFrame

See also

DataFrame.pivot

Pivot without aggregation that can handle non-numeric data.

DataFrame.melt

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

wide_to_long

Wide panel to long format. Less flexible but more user-friendly than melt.

Notes

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
...                          "bar", "bar", "bar", "bar"],
...                    "B": ["one", "one", "one", "two", "two",
...                          "one", "one", "two", "two"],
...                    "C": ["small", "large", "large", "small",
...                          "small", "large", "small", "small",
...                          "large"],
...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
>>> df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

This first example aggregates values by taking the sum.

>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
...                        columns=['C'], aggfunc=np.sum)
>>> table
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

We can also fill missing values using the fill_value parameter.

>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
...                        columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

The next example aggregates by taking the mean across multiple columns.

>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                        aggfunc={'D': np.mean, 'E': np.mean})
>>> table
                D         E
A   C
bar large  5.500000  7.500000
    small  5.500000  8.500000
foo large  2.000000  4.500000
    small  2.333333  4.333333

We can also calculate multiple types of aggregations for any given value column.

>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                        aggfunc={'D': np.mean,
...                                 'E': [min, max, np.mean]})
>>> table
                  D   E
               mean max      mean  min
A   C
bar large  5.500000   9  7.500000    6
    small  5.500000   9  8.500000    8
foo large  2.000000   5  4.500000    4
    small  2.333333   6  4.333333    2
stack(level=-1, dropna=True)[source]

Stack the prescribed level(s) from columns to index.

Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of the current dataframe:

  • if the columns have a single level, the output is a Series;

  • if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s) and the output is a DataFrame.

Parameters:
  • level (int, str, list, default -1) – Level(s) to stack from the column axis onto the index axis, defined as one index or label, or a list of indices or labels.

  • dropna (bool, default True) – Whether to drop rows in the resulting Frame/Series with missing values. Stacking a column level onto the index axis can create combinations of index and column values that are missing from the original dataframe. See Examples section.

Returns:

Stacked dataframe or series.

Return type:

DataFrame or Series

See also

DataFrame.unstack

Unstack prescribed level(s) from index axis onto column axis.

DataFrame.pivot

Reshape dataframe from long format to wide format.

DataFrame.pivot_table

Create a spreadsheet-style pivot table as a DataFrame.

Notes

The function is named by analogy with a collection of books being reorganized from being side by side on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other (in the index of the dataframe).

Reference the user guide for more examples.

Examples

Single level columns

>>> df_single_level_cols = pd.DataFrame([[0, 1], [2, 3]],
...                                     index=['cat', 'dog'],
...                                     columns=['weight', 'height'])

Stacking a dataframe with a single level column axis returns a Series:

>>> df_single_level_cols
     weight height
cat       0      1
dog       2      3
>>> df_single_level_cols.stack()
cat  weight    0
     height    1
dog  weight    2
     height    3
dtype: int64

Multi level columns: simple case

>>> multicol1 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('weight', 'pounds')])
>>> df_multi_level_cols1 = pd.DataFrame([[1, 2], [2, 4]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol1)

Stacking a dataframe with a multi-level column axis:

>>> df_multi_level_cols1
     weight
         kg    pounds
cat       1        2
dog       2        4
>>> df_multi_level_cols1.stack()
            weight
cat kg           1
    pounds       2
dog kg           2
    pounds       4

Missing values

>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
...                                        ('height', 'm')])
>>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked dataframe typically has more values than the original dataframe. Missing values are filled with NaNs:

>>> df_multi_level_cols2
    weight height
        kg      m
cat    1.0    2.0
dog    3.0    4.0
>>> df_multi_level_cols2.stack()
        height  weight
cat kg     NaN     1.0
    m      2.0     NaN
dog kg     NaN     3.0
    m      4.0     NaN

Prescribing the level(s) to be stacked

The first parameter controls which level or levels are stacked:

>>> df_multi_level_cols2.stack(0)
             kg    m
cat height  NaN  2.0
    weight  1.0  NaN
dog height  NaN  4.0
    weight  3.0  NaN
>>> df_multi_level_cols2.stack([0, 1])
cat  height  m     2.0
     weight  kg    1.0
dog  height  m     4.0
     weight  kg    3.0
dtype: float64

Dropping missing values

>>> df_multi_level_cols3 = pd.DataFrame([[None, 1.0], [2.0, 3.0]],
...                                     index=['cat', 'dog'],
...                                     columns=multicol2)

Note that rows where all values are missing are dropped by default but this behaviour can be controlled via the dropna keyword parameter:

>>> df_multi_level_cols3
    weight height
        kg      m
cat    NaN    1.0
dog    2.0    3.0
>>> df_multi_level_cols3.stack(dropna=False)
        height  weight
cat kg     NaN     NaN
    m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN
>>> df_multi_level_cols3.stack(dropna=True)
        height  weight
cat m      1.0     NaN
dog kg     NaN     2.0
    m      3.0     NaN
explode(column, ignore_index=False)[source]

Transform each element of a list-like to a row, replicating index values.

Parameters:
  • column (IndexLabel) –

    Column(s) to explode. For multiple columns, specify a non-empty list with each element be str or tuple, and all specified columns their list-like data on same row of the frame must have matching length.

    New in version 1.3.0: Multi-column explode

  • ignore_index (bool, default False) –

    If True, the resulting index will be labeled 0, 1, …, n - 1.

    New in version 1.1.0.

Returns:

Exploded lists to rows of the subset columns; index will be duplicated for these rows.

Return type:

DataFrame

Raises:

ValueError :

  • If columns of the frame are not unique. * If specified columns to explode is empty list. * If specified columns to explode have not matching count of elements rowwise in the frame.

See also

DataFrame.unstack

Pivot a level of the (necessarily hierarchical) index labels.

DataFrame.melt

Unpivot a DataFrame from wide format to long format.

Series.explode

Explode a DataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({'A': [[0, 1, 2], 'foo', [], [3, 4]],
...                    'B': 1,
...                    'C': [['a', 'b', 'c'], np.nan, [], ['d', 'e']]})
>>> df
           A  B          C
0  [0, 1, 2]  1  [a, b, c]
1        foo  1        NaN
2         []  1         []
3     [3, 4]  1     [d, e]

Single-column explode.

>>> df.explode('A')
     A  B          C
0    0  1  [a, b, c]
0    1  1  [a, b, c]
0    2  1  [a, b, c]
1  foo  1        NaN
2  NaN  1         []
3    3  1     [d, e]
3    4  1     [d, e]

Multi-column explode.

>>> df.explode(list('AC'))
     A  B    C
0    0  1    a
0    1  1    b
0    2  1    c
1  foo  1  NaN
2  NaN  1  NaN
3    3  1    d
3    4  1    e
unstack(level=-1, fill_value=None)[source]

Pivot a level of the (necessarily hierarchical) index labels.

Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted index labels.

If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are not a MultiIndex).

Parameters:
  • level (int, str, or list of these, default -1 (last level)) – Level(s) of index to unstack, can pass level name.

  • fill_value (int, str or dict) – Replace NaN with this value if the unstack produces missing values.

Return type:

Series or DataFrame

See also

DataFrame.pivot

Pivot a table based on column values.

DataFrame.stack

Pivot a level of the column labels (inverse operation from unstack).

Notes

Reference the user guide for more examples.

Examples

>>> index = pd.MultiIndex.from_tuples([('one', 'a'), ('one', 'b'),
...                                    ('two', 'a'), ('two', 'b')])
>>> s = pd.Series(np.arange(1.0, 5.0), index=index)
>>> s
one  a   1.0
     b   2.0
two  a   3.0
     b   4.0
dtype: float64
>>> s.unstack(level=-1)
     a   b
one  1.0  2.0
two  3.0  4.0
>>> s.unstack(level=0)
   one  two
a  1.0   3.0
b  2.0   4.0
>>> df = s.unstack(level=0)
>>> df.unstack()
one  a  1.0
     b  2.0
two  a  3.0
     b  4.0
dtype: float64
melt(id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)[source]

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
  • id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.

  • value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name (scalar) – Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name (scalar, default 'value') – Name to use for the ‘value’ column.

  • col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.

  • ignore_index (bool, default True) –

    If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.

    New in version 1.1.0.

Returns:

Unpivoted DataFrame.

Return type:

DataFrame

See also

melt

Identical method.

pivot_table

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot

Return reshaped DataFrame organized by given index / column values.

DataFrame.explode

Explode a DataFrame from list-like columns to long format.

Notes

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}})
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
>>> df.melt(id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
>>> df.melt(id_vars=['A'], value_vars=['B', 'C'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6

The names of ‘variable’ and ‘value’ columns can be customized:

>>> df.melt(id_vars=['A'], value_vars=['B'],
...         var_name='myVarname', value_name='myValname')
   A myVarname  myValname
0  a         B          1
1  b         B          3
2  c         B          5

Original index values can be kept around:

>>> df.melt(id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
0  a        C      2
1  b        C      4
2  c        C      6

If you have multi-index columns:

>>> df.columns = [list('ABC'), list('DEF')]
>>> df
   A  B  C
   D  E  F
0  a  1  2
1  b  3  4
2  c  5  6
>>> df.melt(col_level=0, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
>>> df.melt(id_vars=[('A', 'D')], value_vars=[('B', 'E')])
  (A, D) variable_0 variable_1  value
0      a          B          E      1
1      b          B          E      3
2      c          B          E      5
diff(periods=1, axis=0)[source]

First discrete difference of element.

Calculates the difference of a DataFrame element compared with another element in the DataFrame (default is element in previous row).

Parameters:
  • periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Take difference over rows (0) or columns (1).

Returns:

First differences of the Series.

Return type:

DataFrame

See also

DataFrame.pct_change

Percent change over given number of periods.

DataFrame.shift

Shift index by desired number of periods with an optional time freq.

Series.diff

First discrete difference of object.

Notes

For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in DataFrame, however dtype of the result is always float64.

Examples

Difference with previous row

>>> df = pd.DataFrame({'a': [1, 2, 3, 4, 5, 6],
...                    'b': [1, 1, 2, 3, 5, 8],
...                    'c': [1, 4, 9, 16, 25, 36]})
>>> df
   a  b   c
0  1  1   1
1  2  1   4
2  3  2   9
3  4  3  16
4  5  5  25
5  6  8  36
>>> df.diff()
     a    b     c
0  NaN  NaN   NaN
1  1.0  0.0   3.0
2  1.0  1.0   5.0
3  1.0  1.0   7.0
4  1.0  2.0   9.0
5  1.0  3.0  11.0

Difference with previous column

>>> df.diff(axis=1)
    a  b   c
0 NaN  0   0
1 NaN -1   3
2 NaN -1   7
3 NaN -1  13
4 NaN  0  20
5 NaN  2  28

Difference with 3rd previous row

>>> df.diff(periods=3)
     a    b     c
0  NaN  NaN   NaN
1  NaN  NaN   NaN
2  NaN  NaN   NaN
3  3.0  2.0  15.0
4  3.0  4.0  21.0
5  3.0  6.0  27.0

Difference with following row

>>> df.diff(periods=-1)
     a    b     c
0 -1.0  0.0  -3.0
1 -1.0 -1.0  -5.0
2 -1.0 -1.0  -7.0
3 -1.0 -2.0  -9.0
4 -1.0 -3.0 -11.0
5  NaN  NaN   NaN

Overflow in input dtype

>>> df = pd.DataFrame({'a': [1, 0]}, dtype=np.uint8)
>>> df.diff()
       a
0    NaN
1  255.0
aggregate(func=None, axis=0, *args, **kwargs)[source]

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

  • scalar, Series or DataFrame – The return can be:

    • scalar : when Series.agg is called with single function

    • Series : when DataFrame.agg is called with a single function

    • DataFrame : when DataFrame.agg is called with several functions

    Return scalar, Series or DataFrame.

  • The aggregation operations are always performed over an axis, either the

  • index (default) or the column axis. This behavior is different from

  • numpy aggregation functions (mean, median, prod, sum, std,

  • var), where the default is to compute the aggregation of the flattened

  • array, e.g., numpy.mean(arr_2d) as opposed to

  • numpy.mean(arr_2d, axis=0).

  • agg is an alias for aggregate. Use the alias.

See also

DataFrame.apply

Perform any type of operations.

DataFrame.transform

Perform transformation type operations.

core.groupby.GroupBy

Perform operations over groups.

core.resample.Resampler

Perform operations over resampled bins.

core.window.Rolling

Perform operations over rolling window.

core.window.Expanding

Perform operations over expanding window.

core.window.ExponentialMovingWindow

Perform operation over exponential weighted window.

Notes

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> df = pd.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>> df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

Different aggregations per column.

>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
        A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0

Aggregate different functions over the columns and rename the index of the resulting DataFrame.

>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean))
     A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

Aggregate over the columns.

>>> df.agg("mean", axis="columns")
0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64
agg(func=None, axis=0, *args, **kwargs)

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

  • scalar, Series or DataFrame – The return can be:

    • scalar : when Series.agg is called with single function

    • Series : when DataFrame.agg is called with a single function

    • DataFrame : when DataFrame.agg is called with several functions

    Return scalar, Series or DataFrame.

  • The aggregation operations are always performed over an axis, either the

  • index (default) or the column axis. This behavior is different from

  • numpy aggregation functions (mean, median, prod, sum, std,

  • var), where the default is to compute the aggregation of the flattened

  • array, e.g., numpy.mean(arr_2d) as opposed to

  • numpy.mean(arr_2d, axis=0).

  • agg is an alias for aggregate. Use the alias.

See also

DataFrame.apply

Perform any type of operations.

DataFrame.transform

Perform transformation type operations.

core.groupby.GroupBy

Perform operations over groups.

core.resample.Resampler

Perform operations over resampled bins.

core.window.Rolling

Perform operations over rolling window.

core.window.Expanding

Perform operations over expanding window.

core.window.ExponentialMovingWindow

Perform operation over exponential weighted window.

Notes

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> df = pd.DataFrame([[1, 2, 3],
...                    [4, 5, 6],
...                    [7, 8, 9],
...                    [np.nan, np.nan, np.nan]],
...                   columns=['A', 'B', 'C'])

Aggregate these functions over the rows.

>>> df.agg(['sum', 'min'])
        A     B     C
sum  12.0  15.0  18.0
min   1.0   2.0   3.0

Different aggregations per column.

>>> df.agg({'A' : ['sum', 'min'], 'B' : ['min', 'max']})
        A    B
sum  12.0  NaN
min   1.0  2.0
max   NaN  8.0

Aggregate different functions over the columns and rename the index of the resulting DataFrame.

>>> df.agg(x=('A', max), y=('B', 'min'), z=('C', np.mean))
     A    B    C
x  7.0  NaN  NaN
y  NaN  2.0  NaN
z  NaN  NaN  6.0

Aggregate over the columns.

>>> df.agg("mean", axis="columns")
0    2.0
1    5.0
2    8.0
3    NaN
dtype: float64
any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: None = ..., **kwargs) Series
any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: Hashable, **kwargs) DataFrame | Series

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

Return type:

Series or DataFrame

See also

numpy.any

Numpy version of this method.

Series.any

Return whether any element is True.

Series.all

Return whether all elements are True.

DataFrame.any

Return whether any element is True over requested axis.

DataFrame.all

Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()
False
>>> pd.Series([True, False]).any()
True
>>> pd.Series([], dtype="float64").any()
False
>>> pd.Series([np.nan]).any()
False
>>> pd.Series([np.nan]).any(skipna=False)
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
   A  B  C
0  1  0  0
1  2  2  0
>>> df.any()
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})
>>> df
       A  B
0   True  1
1  False  2
>>> df.any(axis='columns')
0    True
1    True
dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})
>>> df
       A  B
0   True  1
1  False  0
>>> df.any(axis='columns')
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()
Series([], dtype: bool)
transform(func, axis=0, *args, **kwargs)[source]

Call func on self producing a DataFrame with the same axis shape as self.

Parameters:
  • func (function, str, list-like or dict-like) –

    Function to use for transforming the data. If a function, must either work when passed a DataFrame or when passed to DataFrame.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.

    Accepted combinations are:

    • function

    • string function name

    • list-like of functions and/or function names, e.g. [np.exp, 'sqrt']

    • dict-like of axis labels -> functions, function names or list-like of such.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’: apply function to each column. If 1 or ‘columns’: apply function to each row.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

A DataFrame that must have the same length as self.

Return type:

DataFrame

:raises ValueError : If the returned DataFrame has a different length than self.:

See also

DataFrame.agg

Only perform aggregating type operations.

DataFrame.apply

Invoke function on a DataFrame.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

Examples

>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
>>> df
   A  B
0  0  1
1  1  2
2  2  3
>>> df.transform(lambda x: x + 1)
   A  B
0  1  2
1  2  3
2  3  4

Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to provide several input functions:

>>> s = pd.Series(range(3))
>>> s
0    0
1    1
2    2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
       sqrt        exp
0  0.000000   1.000000
1  1.000000   2.718282
2  1.414214   7.389056

You can call transform on a GroupBy object:

>>> df = pd.DataFrame({
...     "Date": [
...         "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05",
...         "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"],
...     "Data": [5, 8, 6, 1, 50, 100, 60, 120],
... })
>>> df
         Date  Data
0  2015-05-08     5
1  2015-05-07     8
2  2015-05-06     6
3  2015-05-05     1
4  2015-05-08    50
5  2015-05-07   100
6  2015-05-06    60
7  2015-05-05   120
>>> df.groupby('Date')['Data'].transform('sum')
0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data, dtype: int64
>>> df = pd.DataFrame({
...     "c": [1, 1, 1, 2, 2, 2, 2],
...     "type": ["m", "n", "o", "m", "m", "n", "n"]
... })
>>> df
   c type
0  1    m
1  1    n
2  1    o
3  2    m
4  2    m
5  2    n
6  2    n
>>> df['size'] = df.groupby('c')['type'].transform(len)
>>> df
   c type size
0  1    m    3
1  1    n    3
2  1    o    3
3  2    m    4
4  2    m    4
5  2    n    4
6  2    n    4
apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)[source]

Apply a function along an axis of the DataFrame.

Objects passed to the function are Series objects whose index is either the DataFrame’s index (axis=0) or the DataFrame’s columns (axis=1). By default (result_type=None), the final return type is inferred from the return type of the applied function. Otherwise, it depends on the result_type argument.

Parameters:
  • func (function) – Function to apply to each column or row.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    Axis along which the function is applied:

    • 0 or ‘index’: apply function to each column.

    • 1 or ‘columns’: apply function to each row.

  • raw (bool, default False) –

    Determines if row or column is passed as a Series or ndarray object:

    • False : passes each row or column as a Series to the function.

    • True : the passed function will receive ndarray objects instead. If you are just applying a NumPy reduction function this will achieve much better performance.

  • result_type ({'expand', 'reduce', 'broadcast', None}, default None) –

    These only act when axis=1 (columns):

    • ’expand’ : list-like results will be turned into columns.

    • ’reduce’ : returns a Series if possible rather than expanding list-like results. This is the opposite of ‘expand’.

    • ’broadcast’ : results will be broadcast to the original shape of the DataFrame, the original index and columns will be retained.

    The default behaviour (None) depends on the return value of the applied function: list-like results will be returned as a Series of those. However if the apply function returns a Series these are expanded to columns.

  • args (tuple) – Positional arguments to pass to func in addition to the array/series.

  • **kwargs – Additional keyword arguments to pass as keywords arguments to func.

Returns:

Result of applying func along the given axis of the DataFrame.

Return type:

Series or DataFrame

See also

DataFrame.applymap

For elementwise operations.

DataFrame.aggregate

Only perform aggregating type operations.

DataFrame.transform

Only perform transforming type operations.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

Examples

>>> df = pd.DataFrame([[4, 9]] * 3, columns=['A', 'B'])
>>> df
   A  B
0  4  9
1  4  9
2  4  9

Using a numpy universal function (in this case the same as np.sqrt(df)):

>>> df.apply(np.sqrt)
     A    B
0  2.0  3.0
1  2.0  3.0
2  2.0  3.0

Using a reducing function on either axis

>>> df.apply(np.sum, axis=0)
A    12
B    27
dtype: int64
>>> df.apply(np.sum, axis=1)
0    13
1    13
2    13
dtype: int64

Returning a list-like will result in a Series

>>> df.apply(lambda x: [1, 2], axis=1)
0    [1, 2]
1    [1, 2]
2    [1, 2]
dtype: object

Passing result_type='expand' will expand list-like results to columns of a Dataframe

>>> df.apply(lambda x: [1, 2], axis=1, result_type='expand')
   0  1
0  1  2
1  1  2
2  1  2

Returning a Series inside the function is similar to passing result_type='expand'. The resulting column names will be the Series index.

>>> df.apply(lambda x: pd.Series([1, 2], index=['foo', 'bar']), axis=1)
   foo  bar
0    1    2
1    1    2
2    1    2

Passing result_type='broadcast' will ensure the same shape result, whether list-like or scalar is returned by the function, and broadcast it along the axis. The resulting column names will be the originals.

>>> df.apply(lambda x: [1, 2], axis=1, result_type='broadcast')
   A  B
0  1  2
1  1  2
2  1  2
applymap(func, na_action=None, **kwargs)[source]

Apply a function to a Dataframe elementwise.

This method applies a function that accepts and returns a scalar to every element of a DataFrame.

Parameters:
  • func (callable) – Python function, returns a single value from a single value.

  • na_action ({None, 'ignore'}, default None) –

    If ‘ignore’, propagate NaN values, without passing them to func.

    New in version 1.2.

  • **kwargs

    Additional keyword arguments to pass as keywords arguments to func.

    New in version 1.3.0.

Returns:

Transformed DataFrame.

Return type:

DataFrame

See also

DataFrame.apply

Apply a function along input axis of DataFrame.

Examples

>>> df = pd.DataFrame([[1, 2.12], [3.356, 4.567]])
>>> df
       0      1
0  1.000  2.120
1  3.356  4.567
>>> df.applymap(lambda x: len(str(x)))
   0  1
0  3  4
1  5  5

Like Series.map, NA values can be ignored:

>>> df_copy = df.copy()
>>> df_copy.iloc[0, 0] = pd.NA
>>> df_copy.applymap(lambda x: len(str(x)), na_action='ignore')
     0  1
0  NaN  4
1  5.0  5

Note that a vectorized version of func often exists, which will be much faster. You could square each number elementwise.

>>> df.applymap(lambda x: x**2)
           0          1
0   1.000000   4.494400
1  11.262736  20.857489

But it’s better to avoid applymap in that case.

>>> df ** 2
           0          1
0   1.000000   4.494400
1  11.262736  20.857489
add(other, axis='columns', level=None, fill_value=None)

Get Addition of dataframe and other, element-wise (binary operator add).

Equivalent to dataframe + other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, radd.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
all(axis=0, bool_only=None, skipna=True, **kwargs)

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, DataFrame is returned; otherwise, Series is returned.

Return type:

Series or DataFrame

See also

Series.all

Return True if all elements are True.

DataFrame.any

Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()
True
>>> pd.Series([True, False]).all()
False
>>> pd.Series([], dtype="float64").all()
True
>>> pd.Series([np.nan]).all()
True
>>> pd.Series([np.nan]).all(skipna=False)
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
>>> df
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if values in each row all return True.

>>> df.all(axis='columns')
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)
False
cummax(axis=None, skipna=True, *args, **kwargs)

Return cumulative maximum over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative maximum of Series or DataFrame.

Return type:

Series or DataFrame

See also

core.window.expanding.Expanding.max

Similar functionality but ignores NaN values.

DataFrame.max

Return the maximum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()
0    2.0
1    NaN
2    5.0
3    5.0
4    5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0
cummin(axis=None, skipna=True, *args, **kwargs)

Return cumulative minimum over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative minimum of Series or DataFrame.

Return type:

Series or DataFrame

See also

core.window.expanding.Expanding.min

Similar functionality but ignores NaN values.

DataFrame.min

Return the minimum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()
0    2.0
1    NaN
2    2.0
3   -1.0
4   -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
cumprod(axis=None, skipna=True, *args, **kwargs)

Return cumulative product over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative product of Series or DataFrame.

Return type:

Series or DataFrame

See also

core.window.expanding.Expanding.prod

Similar functionality but ignores NaN values.

DataFrame.prod

Return the product over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()
0     2.0
1     NaN
2    10.0
3   -10.0
4    -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0
cumsum(axis=None, skipna=True, *args, **kwargs)

Return cumulative sum over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative sum of Series or DataFrame.

Return type:

Series or DataFrame

See also

core.window.expanding.Expanding.sum

Similar functionality but ignores NaN values.

DataFrame.sum

Return the sum over DataFrame axis.

DataFrame.cummax

Return cumulative maximum over DataFrame axis.

DataFrame.cummin

Return cumulative minimum over DataFrame axis.

DataFrame.cumsum

Return cumulative sum over DataFrame axis.

DataFrame.cumprod

Return cumulative product over DataFrame axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0
div(other, axis='columns', level=None, fill_value=None)

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
divide(other, axis='columns', level=None, fill_value=None)

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
eq(other, axis='columns', level=None)

Get Equal to of dataframe and other, element-wise (binary operator eq).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DataFrame of bool

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
floordiv(other, axis='columns', level=None, fill_value=None)

Get Integer division of dataframe and other, element-wise (binary operator floordiv).

Equivalent to dataframe // other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rfloordiv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
ge(other, axis='columns', level=None)

Get Greater than or equal to of dataframe and other, element-wise (binary operator ge).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DataFrame of bool

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
gt(other, axis='columns', level=None)

Get Greater than of dataframe and other, element-wise (binary operator gt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DataFrame of bool

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
join(other, on=None, how='left', lsuffix='', rsuffix='', sort=False, validate=None)[source]

Join columns of another DataFrame.

Join columns with other DataFrame either on index or on a key column. Efficiently join multiple DataFrame objects by index at once by passing a list.

Parameters:
  • other (DataFrame, Series, or a list containing any combination of them) – Index should be similar to one of the columns in this one. If a Series is passed, its name attribute must be set, and that will be used as the column name in the resulting joined DataFrame.

  • on (str, list of str, or array-like, optional) – Column or index level name(s) in the caller to join on the index in other, otherwise joins index-on-index. If multiple values given, the other DataFrame must have a MultiIndex. Can pass an array as the join key if it is not already contained in the calling DataFrame. Like an Excel VLOOKUP operation.

  • how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'left') –

    How to handle the operation of the two objects.

    • left: use calling frame’s index (or column if on is specified)

    • right: use other’s index.

    • outer: form union of calling frame’s index (or column if on is specified) with other’s index, and sort it. lexicographically.

    • inner: form intersection of calling frame’s index (or column if on is specified) with other’s index, preserving the order of the calling’s one.

    • cross: creates the cartesian product from both frames, preserves the order of the left keys.

      New in version 1.2.0.

  • lsuffix (str, default '') – Suffix to use from left frame’s overlapping columns.

  • rsuffix (str, default '') – Suffix to use from right frame’s overlapping columns.

  • sort (bool, default False) – Order result DataFrame lexicographically by the join key. If False, the order of the join key depends on the join type (how keyword).

  • validate (str, optional) – If specified, checks if join is of specified type. * “one_to_one” or “1:1”: check if join keys are unique in both left and right datasets. * “one_to_many” or “1:m”: check if join keys are unique in left dataset. * “many_to_one” or “m:1”: check if join keys are unique in right dataset. * “many_to_many” or “m:m”: allowed, but does not result in checks. .. versionadded:: 1.5.0

Returns:

A dataframe containing columns from both the caller and other.

Return type:

DataFrame

See also

DataFrame.merge

For column(s)-on-column(s) operations.

Notes

Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.

Support for specifying index levels as the on parameter was added in version 0.23.0.

Examples

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3', 'K4', 'K5'],
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df
  key   A
0  K0  A0
1  K1  A1
2  K2  A2
3  K3  A3
4  K4  A4
5  K5  A5
>>> other = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
...                       'B': ['B0', 'B1', 'B2']})
>>> other
  key   B
0  K0  B0
1  K1  B1
2  K2  B2

Join DataFrames using their indexes.

>>> df.join(other, lsuffix='_caller', rsuffix='_other')
  key_caller   A key_other    B
0         K0  A0        K0   B0
1         K1  A1        K1   B1
2         K2  A2        K2   B2
3         K3  A3       NaN  NaN
4         K4  A4       NaN  NaN
5         K5  A5       NaN  NaN

If we want to join using the key columns, we need to set key to be the index in both df and other. The joined DataFrame will have key as its index.

>>> df.set_index('key').join(other.set_index('key'))
      A    B
key
K0   A0   B0
K1   A1   B1
K2   A2   B2
K3   A3  NaN
K4   A4  NaN
K5   A5  NaN

Another option to join using the key columns is to use the on parameter. DataFrame.join always uses other’s index but we can use any column in df. This method preserves the original DataFrame’s index in the result.

>>> df.join(other.set_index('key'), on='key')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K2  A2   B2
3  K3  A3  NaN
4  K4  A4  NaN
5  K5  A5  NaN

Using non-unique key values shows how they are matched.

>>> df = pd.DataFrame({'key': ['K0', 'K1', 'K1', 'K3', 'K0', 'K1'],
...                    'A': ['A0', 'A1', 'A2', 'A3', 'A4', 'A5']})
>>> df
  key   A
0  K0  A0
1  K1  A1
2  K1  A2
3  K3  A3
4  K0  A4
5  K1  A5
>>> df.join(other.set_index('key'), on='key', validate='m:1')
  key   A    B
0  K0  A0   B0
1  K1  A1   B1
2  K1  A2   B1
3  K3  A3  NaN
4  K0  A4   B0
5  K1  A5   B1
kurt(axis=0, skipna=True, numeric_only=False, **kwargs)

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

kurtosis(axis=0, skipna=True, numeric_only=False, **kwargs)

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

le(other, axis='columns', level=None)

Get Less than or equal to of dataframe and other, element-wise (binary operator le).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DataFrame of bool

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
lt(other, axis='columns', level=None)

Get Less than of dataframe and other, element-wise (binary operator lt).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DataFrame of bool

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
max(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the maximum of the values over the requested axis.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.max()
8
mean(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the mean of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

median(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the median of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

min(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the minimum of the values over the requested axis.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.min()
0
mod(other, axis='columns', level=None, fill_value=None)

Get Modulo of dataframe and other, element-wise (binary operator mod).

Equivalent to dataframe % other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmod.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
mul(other, axis='columns', level=None, fill_value=None)

Get Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
multiply(other, axis='columns', level=None, fill_value=None)

Get Multiplication of dataframe and other, element-wise (binary operator mul).

Equivalent to dataframe * other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rmul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
ne(other, axis='columns', level=None)

Get Not equal to of dataframe and other, element-wise (binary operator ne).

Among flexible wrappers (eq, ne, le, lt, ge, gt) to comparison operators.

Equivalent to ==, !=, <=, <, >=, > with support to choose axis (rows or columns) and level for comparison.

Parameters:
  • other (scalar, sequence, Series, or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}, default 'columns') – Whether to compare by the index (0 or ‘index’) or columns (1 or ‘columns’).

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

Returns:

Result of the comparison.

Return type:

DataFrame of bool

See also

DataFrame.eq

Compare DataFrames for equality elementwise.

DataFrame.ne

Compare DataFrames for inequality elementwise.

DataFrame.le

Compare DataFrames for less than inequality or equality elementwise.

DataFrame.lt

Compare DataFrames for strictly less than inequality elementwise.

DataFrame.ge

Compare DataFrames for greater than inequality or equality elementwise.

DataFrame.gt

Compare DataFrames for strictly greater than inequality elementwise.

Notes

Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).

Examples

>>> df = pd.DataFrame({'cost': [250, 150, 100],
...                    'revenue': [100, 250, 300]},
...                   index=['A', 'B', 'C'])
>>> df
   cost  revenue
A   250      100
B   150      250
C   100      300

Comparison with a scalar, using either the operator or method:

>>> df == 100
    cost  revenue
A  False     True
B  False    False
C   True    False
>>> df.eq(100)
    cost  revenue
A  False     True
B  False    False
C   True    False

When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:

>>> df != pd.Series([100, 250], index=["cost", "revenue"])
    cost  revenue
A   True     True
B   True    False
C  False     True

Use the method to control the broadcast axis:

>>> df.ne(pd.Series([100, 300], index=["A", "D"]), axis='index')
   cost  revenue
A  True    False
B  True     True
C  True     True
D  True     True

When comparing to an arbitrary sequence, the number of columns must match the number elements in other:

>>> df == [250, 100]
    cost  revenue
A   True     True
B  False    False
C  False    False

Use the method to control the axis:

>>> df.eq([250, 250, 100], axis='index')
    cost  revenue
A   True    False
B  False     True
C   True    False

Compare to a DataFrame of different shape.

>>> other = pd.DataFrame({'revenue': [300, 250, 100, 150]},
...                      index=['A', 'B', 'C', 'D'])
>>> other
   revenue
A      300
B      250
C      100
D      150
>>> df.gt(other)
    cost  revenue
A  False    False
B  False    False
C  False     True
D  False    False

Compare to a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'cost': [250, 150, 100, 150, 300, 220],
...                              'revenue': [100, 250, 300, 200, 175, 225]},
...                             index=[['Q1', 'Q1', 'Q1', 'Q2', 'Q2', 'Q2'],
...                                    ['A', 'B', 'C', 'A', 'B', 'C']])
>>> df_multindex
      cost  revenue
Q1 A   250      100
   B   150      250
   C   100      300
Q2 A   150      200
   B   300      175
   C   220      225
>>> df.le(df_multindex, level=1)
       cost  revenue
Q1 A   True     True
   B   True     True
   C   True     True
Q2 A  False     True
   B   True    False
   C   True    False
pow(other, axis='columns', level=None, fill_value=None)

Get Exponential power of dataframe and other, element-wise (binary operator pow).

Equivalent to dataframe ** other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rpow.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
prod(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
product(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
radd(other, axis='columns', level=None, fill_value=None)

Get Addition of dataframe and other, element-wise (binary operator radd).

Equivalent to other + dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, add.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rdiv(other, axis='columns', level=None, fill_value=None)

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rfloordiv(other, axis='columns', level=None, fill_value=None)

Get Integer division of dataframe and other, element-wise (binary operator rfloordiv).

Equivalent to other // dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, floordiv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rmod(other, axis='columns', level=None, fill_value=None)

Get Modulo of dataframe and other, element-wise (binary operator rmod).

Equivalent to other % dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mod.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rmul(other, axis='columns', level=None, fill_value=None)

Get Multiplication of dataframe and other, element-wise (binary operator rmul).

Equivalent to other * dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, mul.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rpow(other, axis='columns', level=None, fill_value=None)

Get Exponential power of dataframe and other, element-wise (binary operator rpow).

Equivalent to other ** dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, pow.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rsub(other, axis='columns', level=None, fill_value=None)

Get Subtraction of dataframe and other, element-wise (binary operator rsub).

Equivalent to other - dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, sub.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
rtruediv(other, axis='columns', level=None, fill_value=None)

Get Floating division of dataframe and other, element-wise (binary operator rtruediv).

Equivalent to other / dataframe, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, truediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
sem(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0), columns (1)}) – For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

Return type:

Series or DataFrame (if level specified)

skew(axis=0, skipna=True, numeric_only=False, **kwargs)

Return unbiased skew over requested axis.

Normalized by N-1.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

std(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0), columns (1)}) – For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

Return type:

Series or DataFrame (if level specified)

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                    'age': [21, 25, 62, 43],
...                    'height': [1.61, 1.87, 1.49, 2.01]}
...                   ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()
age       18.786076
height     0.237417
dtype: float64

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)
age       16.269219
height     0.205609
dtype: float64
sub(other, axis='columns', level=None, fill_value=None)

Get Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
subtract(other, axis='columns', level=None, fill_value=None)

Get Subtraction of dataframe and other, element-wise (binary operator sub).

Equivalent to dataframe - other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rsub.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
sum(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)

Return the sum of the values over the requested axis.

This is equivalent to the method numpy.sum.

Parameters:
  • axis ({index (0), columns (1)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

Series or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.sum()
14

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default
0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

>>> pd.Series([], dtype="float64").sum(min_count=1)
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).sum()
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
truediv(other, axis='columns', level=None, fill_value=None)

Get Floating division of dataframe and other, element-wise (binary operator truediv).

Equivalent to dataframe / other, but with support to substitute a fill_value for missing data in one of the inputs. With reverse version, rtruediv.

Among flexible wrappers (add, sub, mul, div, mod, pow) to arithmetic operators: +, -, *, /, //, %, **.

Parameters:
  • other (scalar, sequence, Series, dict or DataFrame) – Any single or multiple element data structure, or list-like object.

  • axis ({0 or 'index', 1 or 'columns'}) – Whether to compare by the index (0 or ‘index’) or columns. (1 or ‘columns’). For Series input, axis to match Series index on.

  • level (int or label) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (float or None, default None) – Fill existing missing (NaN) values, and any new element needed for successful DataFrame alignment, with this value before computation. If data in both corresponding DataFrame locations is missing the result will be missing.

Returns:

Result of the arithmetic operation.

Return type:

DataFrame

See also

DataFrame.add

Add DataFrames.

DataFrame.sub

Subtract DataFrames.

DataFrame.mul

Multiply DataFrames.

DataFrame.div

Divide DataFrames (float division).

DataFrame.truediv

Divide DataFrames (float division).

DataFrame.floordiv

Divide DataFrames (integer division).

DataFrame.mod

Calculate modulo (remainder after division).

DataFrame.pow

Calculate exponential power.

Notes

Mismatched indices will be unioned together.

Examples

>>> df = pd.DataFrame({'angles': [0, 3, 4],
...                    'degrees': [360, 180, 360]},
...                   index=['circle', 'triangle', 'rectangle'])
>>> df
           angles  degrees
circle          0      360
triangle        3      180
rectangle       4      360

Add a scalar with operator version which return the same results.

>>> df + 1
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361
>>> df.add(1)
           angles  degrees
circle          1      361
triangle        4      181
rectangle       5      361

Divide by constant with reverse version.

>>> df.div(10)
           angles  degrees
circle        0.0     36.0
triangle      0.3     18.0
rectangle     0.4     36.0
>>> df.rdiv(10)
             angles   degrees
circle          inf  0.027778
triangle   3.333333  0.055556
rectangle  2.500000  0.027778

Subtract a list and Series by axis with operator version.

>>> df - [1, 2]
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub([1, 2], axis='columns')
           angles  degrees
circle         -1      358
triangle        2      178
rectangle       3      358
>>> df.sub(pd.Series([1, 1, 1], index=['circle', 'triangle', 'rectangle']),
...        axis='index')
           angles  degrees
circle         -1      359
triangle        2      179
rectangle       3      359

Multiply a dictionary by axis.

>>> df.mul({'angles': 0, 'degrees': 2})
            angles  degrees
circle           0      720
triangle         0      360
rectangle        0      720
>>> df.mul({'circle': 0, 'triangle': 2, 'rectangle': 3}, axis='index')
            angles  degrees
circle           0        0
triangle         6      360
rectangle       12     1080

Multiply a DataFrame of different shape with operator version.

>>> other = pd.DataFrame({'angles': [0, 3, 4]},
...                      index=['circle', 'triangle', 'rectangle'])
>>> other
           angles
circle          0
triangle        3
rectangle       4
>>> df * other
           angles  degrees
circle          0      NaN
triangle        9      NaN
rectangle      16      NaN
>>> df.mul(other, fill_value=0)
           angles  degrees
circle          0      0.0
triangle        9      0.0
rectangle      16      0.0

Divide by a MultiIndex by level.

>>> df_multindex = pd.DataFrame({'angles': [0, 3, 4, 4, 5, 6],
...                              'degrees': [360, 180, 360, 360, 540, 720]},
...                             index=[['A', 'A', 'A', 'B', 'B', 'B'],
...                                    ['circle', 'triangle', 'rectangle',
...                                     'square', 'pentagon', 'hexagon']])
>>> df_multindex
             angles  degrees
A circle          0      360
  triangle        3      180
  rectangle       4      360
B square          4      360
  pentagon        5      540
  hexagon         6      720
>>> df.div(df_multindex, level=1, fill_value=0)
             angles  degrees
A circle        NaN      1.0
  triangle      1.0      1.0
  rectangle     1.0      1.0
B square        0.0      0.0
  pentagon      0.0      0.0
  hexagon       0.0      0.0
var(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0), columns (1)}) – For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

Return type:

Series or DataFrame (if level specified)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01
>>> df.var()
age       352.916667
height      0.056367
dtype: float64

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)
age       264.687500
height      0.042275
dtype: float64
merge(right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)[source]

Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

Parameters:
  • right (DataFrame or named Series) – Object to merge with.

  • how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –

    Type of merge to be performed.

    • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

    • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

    • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

    • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    • cross: creates the cartesian product from both frames, preserves the order of the left keys.

      New in version 1.2.0.

  • on (label or list) – Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

  • left_on (label or list, or array-like) – Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

  • right_on (label or list, or array-like) – Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

  • left_index (bool, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.

  • right_index (bool, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index.

  • sort (bool, default False) – Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).

  • suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

  • copy (bool, default True) – If False, avoid copy if possible.

  • indicator (bool or str, default False) – If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.

  • validate (str, optional) –

    If specified, checks if merge is of specified type.

    • ”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

    • ”one_to_many” or “1:m”: check if merge keys are unique in left dataset.

    • ”many_to_one” or “m:1”: check if merge keys are unique in right dataset.

    • ”many_to_many” or “m:m”: allowed, but does not result in checks.

Returns:

A DataFrame of the two merged objects.

Return type:

DataFrame

See also

merge_ordered

Merge with optional filling/interpolation.

merge_asof

Merge on nearest keys.

DataFrame.join

Similar method using indices.

Notes

Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0 Support for merging named Series objects was added in version 0.24.0

Examples

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [5, 6, 7, 8]})
>>> df1
    lkey value
0   foo      1
1   bar      2
2   baz      3
3   foo      5
>>> df2
    rkey value
0   foo      5
1   bar      6
2   baz      7
3   foo      8

Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.

>>> df1.merge(df2, left_on='lkey', right_on='rkey')
  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  foo        5  foo        5
3  foo        5  foo        8
4  bar        2  bar        6
5  baz        3  baz        7

Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey',
...           suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
    Index(['value'], dtype='object')
>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
>>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
>>> df1
      a  b
0   foo  1
1   bar  2
>>> df2
      a  c
0   foo  3
1   baz  4
>>> df1.merge(df2, how='inner', on='a')
      a  b  c
0   foo  1  3
>>> df1.merge(df2, how='left', on='a')
      a  b  c
0   foo  1  3.0
1   bar  2  NaN
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']})
>>> df2 = pd.DataFrame({'right': [7, 8]})
>>> df1
    left
0   foo
1   bar
>>> df2
    right
0   7
1   8
>>> df1.merge(df2, how='cross')
   left  right
0   foo      7
1   foo      8
2   bar      7
3   bar      8
round(decimals=0, *args, **kwargs)[source]

Round a DataFrame to a variable number of decimal places.

Parameters:
  • decimals (int, dict, Series) – Number of decimal places to round each column to. If an int is given, round each column to the same number of places. Otherwise dict and Series round to variable numbers of places. Column names should be in the keys if decimals is a dict-like, or in the index if decimals is a Series. Any columns not included in decimals will be left as is. Elements of decimals which are not columns of the input will be ignored.

  • *args – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.

Returns:

A DataFrame with the affected columns rounded to the specified number of decimal places.

Return type:

DataFrame

See also

numpy.around

Round a numpy array to the given number of decimals.

Series.round

Round a Series to the given number of decimals.

Examples

>>> df = pd.DataFrame([(.21, .32), (.01, .67), (.66, .03), (.21, .18)],
...                   columns=['dogs', 'cats'])
>>> df
    dogs  cats
0  0.21  0.32
1  0.01  0.67
2  0.66  0.03
3  0.21  0.18

By providing an integer each column is rounded to the same number of decimal places

>>> df.round(1)
    dogs  cats
0   0.2   0.3
1   0.0   0.7
2   0.7   0.0
3   0.2   0.2

With a dict, the number of places for specific columns can be specified with the column names as key and the number of decimal places as value

>>> df.round({'dogs': 1, 'cats': 0})
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0

Using a Series, the number of places for specific columns can be specified with the column names as index and the number of decimal places as value

>>> decimals = pd.Series([0, 1], index=['cats', 'dogs'])
>>> df.round(decimals)
    dogs  cats
0   0.2   0.0
1   0.0   1.0
2   0.7   0.0
3   0.2   0.0
corr(method='pearson', min_periods=1, numeric_only=False)[source]

Compute pairwise correlation of columns, excluding NA/null values.

Parameters:
  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method of correlation:

    • pearson : standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: callable with input two 1d ndarrays

      and returning a float. Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

  • min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result. Currently only available for Pearson and Spearman correlation.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    New in version 1.5.0.

    Changed in version 2.0.0: The default value of numeric_only is now False.

Returns:

Correlation matrix.

Return type:

DataFrame

See also

DataFrame.corrwith

Compute pairwise correlation with another DataFrame or Series.

Series.corr

Compute the correlation between two Series.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> df = pd.DataFrame([(.2, .3), (.0, .6), (.6, .0), (.2, .1)],
...                   columns=['dogs', 'cats'])
>>> df.corr(method=histogram_intersection)
      dogs  cats
dogs   1.0   0.3
cats   0.3   1.0
>>> df = pd.DataFrame([(1, 1), (2, np.nan), (np.nan, 3), (4, 4)],
...                   columns=['dogs', 'cats'])
>>> df.corr(min_periods=3)
      dogs  cats
dogs   1.0   NaN
cats   NaN   1.0
cov(min_periods=None, ddof=1, numeric_only=False)[source]

Compute pairwise covariance of columns, excluding NA/null values.

Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance matrix of the columns of the DataFrame.

Both NA and null values are automatically excluded from the calculation. (See the note below about bias from missing values.) A threshold can be set for the minimum number of observations for each value created. Comparisons with observations below this threshold will be returned as NaN.

This method is generally used for the analysis of time series data to understand the relationship between different measures across time.

Parameters:
  • min_periods (int, optional) – Minimum number of observations required per pair of columns to have a valid result.

  • ddof (int, default 1) –

    Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

    New in version 1.1.0.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    New in version 1.5.0.

    Changed in version 2.0.0: The default value of numeric_only is now False.

Returns:

The covariance matrix of the series of the DataFrame.

Return type:

DataFrame

See also

Series.cov

Compute covariance with another Series.

core.window.ewm.ExponentialMovingWindow.cov

Exponential weighted sample covariance.

core.window.expanding.Expanding.cov

Expanding sample covariance.

core.window.rolling.Rolling.cov

Rolling sample covariance.

Notes

Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.

For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.

However, for many applications this estimate may not be acceptable because the estimate covariance matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more details.

Examples

>>> df = pd.DataFrame([(1, 2), (0, 3), (2, 0), (1, 1)],
...                   columns=['dogs', 'cats'])
>>> df.cov()
          dogs      cats
dogs  0.666667 -1.000000
cats -1.000000  1.666667
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(1000, 5),
...                   columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
          a         b         c         d         e
a  0.998438 -0.020161  0.059277 -0.008943  0.014144
b -0.020161  1.059352 -0.008543 -0.024738  0.009826
c  0.059277 -0.008543  1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486  0.921297 -0.013692
e  0.014144  0.009826 -0.000271 -0.013692  0.977795

Minimum number of periods

This method also supports an optional min_periods keyword that specifies the required minimum number of non-NA observations for each column pair in order to have a valid result:

>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
...                   columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> df.cov(min_periods=12)
          a         b         c
a  0.316741       NaN -0.150812
b       NaN  1.248003  0.191417
c -0.150812  0.191417  0.895202
corrwith(other, axis=0, drop=False, method='pearson', numeric_only=False)[source]

Compute pairwise correlation.

Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or DataFrame. DataFrames are first aligned along both axes before computing the correlations.

Parameters:
  • other (DataFrame, Series) – Object with which to compute correlations.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ to compute row-wise, 1 or ‘columns’ for column-wise.

  • drop (bool, default False) – Drop missing indices from result.

  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method of correlation:

    • pearson : standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: callable with input two 1d ndarrays

      and returning a float.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    New in version 1.5.0.

    Changed in version 2.0.0: The default value of numeric_only is now False.

Returns:

Pairwise correlations.

Return type:

Series

See also

DataFrame.corr

Compute pairwise correlation of columns.

Examples

>>> index = ["a", "b", "c", "d", "e"]
>>> columns = ["one", "two", "three", "four"]
>>> df1 = pd.DataFrame(np.arange(20).reshape(5, 4), index=index, columns=columns)
>>> df2 = pd.DataFrame(np.arange(16).reshape(4, 4), index=index[:4], columns=columns)
>>> df1.corrwith(df2)
one      1.0
two      1.0
three    1.0
four     1.0
dtype: float64
>>> df2.corrwith(df1, axis=1)
a    1.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
count(axis=0, numeric_only=False)[source]

Count non-NA cells for each column or row.

The values None, NaN, NaT, and optionally numpy.inf (depending on pandas.options.mode.use_inf_as_na) are considered NA.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – If 0 or ‘index’ counts are generated for each column. If 1 or ‘columns’ counts are generated for each row.

  • numeric_only (bool, default False) – Include only float, int or boolean data.

Returns:

For each column/row the number of non-NA/null entries. If level is specified returns a DataFrame.

Return type:

Series or DataFrame

See also

Series.count

Number of non-NA elements in a Series.

DataFrame.value_counts

Count unique combinations of columns.

DataFrame.shape

Number of DataFrame rows and columns (including NA elements).

DataFrame.isna

Boolean same-sized DataFrame showing places of NA elements.

Examples

Constructing DataFrame from a dictionary:

>>> df = pd.DataFrame({"Person":
...                    ["John", "Myla", "Lewis", "John", "Myla"],
...                    "Age": [24., np.nan, 21., 33, 26],
...                    "Single": [False, True, True, True, False]})
>>> df
   Person   Age  Single
0    John  24.0   False
1    Myla   NaN    True
2   Lewis  21.0    True
3    John  33.0    True
4    Myla  26.0   False

Notice the uncounted NA values:

>>> df.count()
Person    5
Age       4
Single    5
dtype: int64

Counts for each row:

>>> df.count(axis='columns')
0    3
1    2
2    3
3    3
4    3
dtype: int64
nunique(axis=0, dropna=True)[source]

Count number of distinct elements in specified axis.

Return Series with number of distinct elements. Can ignore NaN values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • dropna (bool, default True) – Don’t include NaN in the counts.

Return type:

Series

See also

Series.nunique

Method nunique for Series.

DataFrame.count

Count non-NA cells for each column or row.

Examples

>>> df = pd.DataFrame({'A': [4, 5, 6], 'B': [4, 1, 1]})
>>> df.nunique()
A    3
B    2
dtype: int64
>>> df.nunique(axis=1)
0    1
1    2
2    2
dtype: int64
idxmin(axis=0, skipna=True, numeric_only=False)[source]

Return index of first occurrence of minimum over requested axis.

NA/null values are excluded.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    New in version 1.5.0.

Returns:

Indexes of minima along the specified axis.

Return type:

Series

Raises:

ValueError

  • If the row/column is empty

See also

Series.idxmin

Return index of the minimum element.

Notes

This method is the DataFrame version of ndarray.argmin.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],
...                     'co2_emissions': [37.2, 19.66, 1712]},
...                   index=['Pork', 'Wheat Products', 'Beef'])
>>> df
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the minimum value in each column.

>>> df.idxmin()
consumption                Pork
co2_emissions    Wheat Products
dtype: object

To return the index for the minimum value in each row, use axis="columns".

>>> df.idxmin(axis="columns")
Pork                consumption
Wheat Products    co2_emissions
Beef                consumption
dtype: object
idxmax(axis=0, skipna=True, numeric_only=False)[source]

Return index of first occurrence of maximum over requested axis.

NA/null values are excluded.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to use. 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    New in version 1.5.0.

Returns:

Indexes of maxima along the specified axis.

Return type:

Series

Raises:

ValueError

  • If the row/column is empty

See also

Series.idxmax

Return index of the maximum element.

Notes

This method is the DataFrame version of ndarray.argmax.

Examples

Consider a dataset containing food consumption in Argentina.

>>> df = pd.DataFrame({'consumption': [10.51, 103.11, 55.48],
...                     'co2_emissions': [37.2, 19.66, 1712]},
...                   index=['Pork', 'Wheat Products', 'Beef'])
>>> df
                consumption  co2_emissions
Pork                  10.51         37.20
Wheat Products       103.11         19.66
Beef                  55.48       1712.00

By default, it returns the index for the maximum value in each column.

>>> df.idxmax()
consumption     Wheat Products
co2_emissions             Beef
dtype: object

To return the index for the maximum value in each row, use axis="columns".

>>> df.idxmax(axis="columns")
Pork              co2_emissions
Wheat Products     consumption
Beef              co2_emissions
dtype: object
mode(axis=0, numeric_only=False, dropna=True)[source]

Get the mode(s) of each element along the selected axis.

The mode of a set of values is the value that appears most often. It can be multiple values.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) –

    The axis to iterate over while searching for the mode:

    • 0 or ‘index’ : get mode of each column

    • 1 or ‘columns’ : get mode of each row.

  • numeric_only (bool, default False) – If True, only apply to numeric columns.

  • dropna (bool, default True) – Don’t consider counts of NaN/NaT.

Returns:

The modes of each column or row.

Return type:

DataFrame

See also

Series.mode

Return the highest frequency value in a Series.

Series.value_counts

Return the counts of values in a Series.

Examples

>>> df = pd.DataFrame([('bird', 2, 2),
...                    ('mammal', 4, np.nan),
...                    ('arthropod', 8, 0),
...                    ('bird', 2, np.nan)],
...                   index=('falcon', 'horse', 'spider', 'ostrich'),
...                   columns=('species', 'legs', 'wings'))
>>> df
           species  legs  wings
falcon        bird     2    2.0
horse       mammal     4    NaN
spider   arthropod     8    0.0
ostrich       bird     2    NaN

By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the resulting DataFrame has two rows, the second row of species and legs contains NaN.

>>> df.mode()
  species  legs  wings
0    bird   2.0    0.0
1     NaN   NaN    2.0

Setting dropna=False NaN values are considered and they can be the mode (like for wings).

>>> df.mode(dropna=False)
  species  legs  wings
0    bird     2    NaN

Setting numeric_only=True, only the mode of numeric columns is computed, and columns of other types are ignored.

>>> df.mode(numeric_only=True)
   legs  wings
0   2.0    0.0
1   NaN    2.0

To compute the mode over columns and not rows, use the axis parameter:

>>> df.mode(axis='columns', numeric_only=True)
           0    1
falcon   2.0  NaN
horse    4.0  NaN
spider   0.0  8.0
ostrich  2.0  NaN
quantile(q: float = 0.5, axis: int | Literal['index', 'columns', 'rows'] = 0, numeric_only: bool = False, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series[source]
quantile(q: ExtensionArray | ndarray | Index | Series | Sequence[float], axis: int | Literal['index', 'columns', 'rows'] = 0, numeric_only: bool = False, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series | DataFrame
quantile(q: float | ExtensionArray | ndarray | Index | Series | Sequence[float] = 0.5, axis: int | Literal['index', 'columns', 'rows'] = 0, numeric_only: bool = False, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series | DataFrame

Return values at the given quantile over requested axis.

Parameters:
  • q (float or array-like, default 0.5 (50% quantile)) – Value between 0 <= q <= 1, the quantile(s) to compute.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Equals 0 or ‘index’ for row-wise, 1 or ‘columns’ for column-wise.

  • numeric_only (bool, default False) –

    Include only float, int or boolean data.

    Changed in version 2.0.0: The default value of numeric_only is now False.

  • interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –

    This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

    • linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

    • lower: i.

    • higher: j.

    • nearest: i or j whichever is nearest.

    • midpoint: (i + j) / 2.

  • method ({'single', 'table'}, default 'single') – Whether to compute quantiles per-column (‘single’) or over all columns (‘table’). When ‘table’, the only allowed interpolation methods are ‘nearest’, ‘lower’, and ‘higher’.

Returns:

If q is an array, a DataFrame will be returned where the

index is q, the columns are the columns of self, and the values are the quantiles.

If q is a float, a Series will be returned where the

index is the columns of self and the values are the quantiles.

Return type:

Series or DataFrame

See also

core.window.rolling.Rolling.quantile

Rolling quantile.

numpy.percentile

Numpy function to compute the percentile.

Examples

>>> df = pd.DataFrame(np.array([[1, 1], [2, 10], [3, 100], [4, 100]]),
...                   columns=['a', 'b'])
>>> df.quantile(.1)
a    1.3
b    3.7
Name: 0.1, dtype: float64
>>> df.quantile([.1, .5])
       a     b
0.1  1.3   3.7
0.5  2.5  55.0

Specifying method=’table’ will compute the quantile over all columns.

>>> df.quantile(.1, method="table", interpolation="nearest")
a    1
b    1
Name: 0.1, dtype: int64
>>> df.quantile([.1, .5], method="table", interpolation="nearest")
     a    b
0.1  1    1
0.5  3  100

Specifying numeric_only=False will also compute the quantile of datetime and timedelta data.

>>> df = pd.DataFrame({'A': [1, 2],
...                    'B': [pd.Timestamp('2010'),
...                          pd.Timestamp('2011')],
...                    'C': [pd.Timedelta('1 days'),
...                          pd.Timedelta('2 days')]})
>>> df.quantile(0.5, numeric_only=False)
A                    1.5
B    2010-07-02 12:00:00
C        1 days 12:00:00
Name: 0.5, dtype: object
asfreq(freq, method=None, how=None, normalize=False, fill_value=None)[source]

Convert time series to specified frequency.

Returns the original data conformed to a new index with the specified frequency.

If the index of this DataFrame is a PeriodIndex, the new index is the result of transforming the original index with PeriodIndex.asfreq (so the original index will map one-to-one to the new index).

Otherwise, the new index will be equivalent to pd.date_range(start, end, freq=freq) where start and end are, respectively, the first and last entries in the original index (see pandas.date_range()). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN), unless a method for filling such unknowns is provided (see the method parameter below).

The resample() method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.

Parameters:
  • freq (DateOffset or str) – Frequency DateOffset or string.

  • method ({'backfill'/'bfill', 'pad'/'ffill'}, default None) –

    Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):

    • ’pad’ / ‘ffill’: propagate last valid observation forward to next valid

    • ’backfill’ / ‘bfill’: use NEXT valid observation to fill.

  • how ({'start', 'end'}, default end) – For PeriodIndex only (see PeriodIndex.asfreq).

  • normalize (bool, default False) – Whether to reset output index to midnight.

  • fill_value (scalar, optional) – Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).

Returns:

DataFrame object reindexed to the specified frequency.

Return type:

DataFrame

See also

reindex

Conform DataFrame to new index with optional filling logic.

Notes

To learn more about the frequency strings, please see this link.

Examples

Start by creating a series with 4 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=4, freq='T')
>>> series = pd.Series([0.0, None, 2.0, 3.0], index=index)
>>> df = pd.DataFrame({'s': series})
>>> df
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:01:00    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:03:00    3.0

Upsample the series into 30 second bins.

>>> df.asfreq(freq='30S')
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    NaN
2000-01-01 00:03:00    3.0

Upsample again, providing a fill value.

>>> df.asfreq(freq='30S', fill_value=9.0)
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    9.0
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    9.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    9.0
2000-01-01 00:03:00    3.0

Upsample again, providing a method.

>>> df.asfreq(freq='30S', method='bfill')
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    2.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    3.0
2000-01-01 00:03:00    3.0
resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, on=None, level=None, origin='start_day', offset=None, group_keys=False)[source]

Resample time-series data.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the on/level keyword parameter.

Parameters:
  • rule (DateOffset, Timedelta or str) – The offset string or object representing target conversion.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.

  • closed ({'right', 'left'}, default None) – Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

  • label ({'right', 'left'}, default None) – Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

  • convention ({'start', 'end', 's', 'e'}, default 'start') – For PeriodIndex only, controls whether to use the start or end of rule.

  • kind ({'timestamp', 'period'}, optional, default None) – Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.

  • on (str, optional) – For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

  • level (str or int, optional) – For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

  • origin (Timestamp or str, default 'start_day') –

    The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:

    • ’epoch’: origin is 1970-01-01

    • ’start’: origin is the first value of the timeseries

    • ’start_day’: origin is the first day at midnight of the timeseries

    New in version 1.1.0.

    • ’end’: origin is the last value of the timeseries

    • ’end_day’: origin is the ceiling midnight of the last day

    New in version 1.3.0.

  • offset (Timedelta or str, default is None) –

    An offset timedelta added to the origin.

    New in version 1.1.0.

  • group_keys (bool, default False) –

    Whether to include the group keys in the result index when using .apply() on the resampled object.

    New in version 1.5.0: Not specifying group_keys will retain values-dependent behavior from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples).

    Changed in version 2.0.0: group_keys now defaults to False.

Returns:

Resampler object.

Return type:

pandas.core.Resampler

See also

Series.resample

Resample a Series.

DataFrame.resample

Resample a DataFrame.

groupby

Group DataFrame by mapping, function, label, or list of labels.

asfreq

Reindex a DataFrame with the given frequency without grouping.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the ffill method.

>>> series.resample('30S').ffill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(arraylike):
...     return np.sum(arraylike) + 5
...
>>> series.resample('3T').apply(custom_resampler)
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
...                                             freq='A',
...                                             periods=2))
>>> s
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
...                                                   freq='Q',
...                                                   periods=4))
>>> q
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
...      'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df = pd.DataFrame(d)
>>> df['week_starting'] = pd.date_range('01/01/2018',
...                                     periods=8,
...                                     freq='W')
>>> df
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')
>>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
...       'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df2 = pd.DataFrame(
...     d2,
...     index=pd.MultiIndex.from_product(
...         [days, ['morning', 'afternoon']]
...     )
... )
>>> df2
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

If you want to adjust the start of the bins based on a fixed timestamp:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int64
>>> ts.resample('17min').sum()
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum()
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum()
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17T, dtype: int64

If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:

>>> ts.resample('17min', origin='start').sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64

If you want to take the largest Timestamp as the end of the bins:

>>> ts.resample('17min', origin='end').sum()
2000-10-01 23:35:00     0
2000-10-01 23:52:00    18
2000-10-02 00:09:00    27
2000-10-02 00:26:00    63
Freq: 17T, dtype: int64

In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:

>>> ts.resample('17min', origin='end_day').sum()
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int64
to_timestamp(freq=None, how='start', axis=0, copy=None)[source]

Cast to DatetimeIndex of timestamps, at beginning of period.

Parameters:
  • freq (str, default frequency of PeriodIndex) – Desired frequency.

  • how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).

  • copy (bool, default True) – If False then underlying input data is not copied.

Returns:

The DataFrame has a DatetimeIndex.

Return type:

DataFrame

Examples

>>> idx = pd.PeriodIndex(['2023', '2024'], freq='Y')
>>> d = {'col1': [1, 2], 'col2': [3, 4]}
>>> df1 = pd.DataFrame(data=d, index=idx)
>>> df1
      col1   col2
2023     1      3
2024     2      4

The resulting timestamps will be at the beginning of the year in this case

>>> df1 = df1.to_timestamp()
>>> df1
            col1   col2
2023-01-01     1      3
2024-01-01     2      4
>>> df1.index
DatetimeIndex(['2023-01-01', '2024-01-01'], dtype='datetime64[ns]', freq=None)

Using freq which is the offset that the Timestamps will have

>>> df2 = pd.DataFrame(data=d, index=idx)
>>> df2 = df2.to_timestamp(freq='M')
>>> df2
            col1   col2
2023-01-31     1      3
2024-01-31     2      4
>>> df2.index
DatetimeIndex(['2023-01-31', '2024-01-31'], dtype='datetime64[ns]', freq=None)
to_period(freq=None, axis=0, copy=None)[source]

Convert DataFrame from DatetimeIndex to PeriodIndex.

Convert DataFrame from DatetimeIndex to PeriodIndex with desired frequency (inferred from index if not passed).

Parameters:
  • freq (str, default) – Frequency of the PeriodIndex.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to convert (the index by default).

  • copy (bool, default True) – If False then underlying input data is not copied.

Returns:

The DataFrame has a PeriodIndex.

Return type:

DataFrame

Examples

>>> idx = pd.to_datetime(
...     [
...         "2001-03-31 00:00:00",
...         "2002-05-31 00:00:00",
...         "2003-08-31 00:00:00",
...     ]
... )
>>> idx
DatetimeIndex(['2001-03-31', '2002-05-31', '2003-08-31'],
dtype='datetime64[ns]', freq=None)
>>> idx.to_period("M")
PeriodIndex(['2001-03', '2002-05', '2003-08'], dtype='period[M]')

For the yearly frequency

>>> idx.to_period("Y")
PeriodIndex(['2001', '2002', '2003'], dtype='period[A-DEC]')
isin(values)[source]

Whether each element in the DataFrame is contained in values.

Parameters:

values (iterable, Series, DataFrame or dict) – The result will only be true at a location if all the labels match. If values is a Series, that’s the index. If values is a dict, the keys must be the column names, which must match. If values is a DataFrame, then both the index and column labels must match.

Returns:

DataFrame of booleans showing whether each element in the DataFrame is contained in values.

Return type:

DataFrame

See also

DataFrame.eq

Equality test for DataFrame.

Series.isin

Equivalent method on Series.

Series.str.contains

Test if pattern or regex is contained within a string of a Series or Index.

Examples

>>> df = pd.DataFrame({'num_legs': [2, 4], 'num_wings': [2, 0]},
...                   index=['falcon', 'dog'])
>>> df
        num_legs  num_wings
falcon         2          2
dog            4          0

When values is a list check whether every value in the DataFrame is present in the list (which animals have 0 or 2 legs or wings)

>>> df.isin([0, 2])
        num_legs  num_wings
falcon      True       True
dog        False       True

To check if values is not in the DataFrame, use the ~ operator:

>>> ~df.isin([0, 2])
        num_legs  num_wings
falcon     False      False
dog         True      False

When values is a dict, we can pass values to check for each column separately:

>>> df.isin({'num_wings': [0, 3]})
        num_legs  num_wings
falcon     False      False
dog        False       True

When values is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not match based on the number of legs in other.

>>> other = pd.DataFrame({'num_legs': [8, 3], 'num_wings': [0, 2]},
...                      index=['spider', 'falcon'])
>>> df.isin(other)
        num_legs  num_wings
falcon     False       True
dog        False      False
index

The index (row labels) of the DataFrame.

columns

The column labels of the DataFrame.

plot

alias of PlotAccessor

hist(column=None, by=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False, sharey=False, figsize=None, layout=None, bins=10, backend=None, legend=False, **kwargs)

Make a histogram of the DataFrame’s columns.

A histogram is a representation of the distribution of data. This function calls matplotlib.pyplot.hist(), on each series in the DataFrame, resulting in one histogram per column.

Parameters:
  • data (DataFrame) – The pandas object holding the data.

  • column (str or sequence, optional) – If passed, will be used to limit data to a subset of columns.

  • by (object, optional) – If passed, then used to form histograms for separate groups.

  • grid (bool, default True) – Whether to show axis grid lines.

  • xlabelsize (int, default None) – If specified changes the x-axis label size.

  • xrot (float, default None) – Rotation of x axis labels. For example, a value of 90 displays the x labels rotated 90 degrees clockwise.

  • ylabelsize (int, default None) – If specified changes the y-axis label size.

  • yrot (float, default None) – Rotation of y axis labels. For example, a value of 90 displays the y labels rotated 90 degrees clockwise.

  • ax (Matplotlib axes object, default None) – The axes to plot the histogram on.

  • sharex (bool, default True if ax is None else False) – In case subplots=True, share x axis and set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels for all subplots in a figure.

  • sharey (bool, default False) – In case subplots=True, share y axis and set some y axis labels to invisible.

  • figsize (tuple, optional) – The size in inches of the figure to create. Uses the value in matplotlib.rcParams by default.

  • layout (tuple, optional) – Tuple of (rows, columns) for the layout of the histograms.

  • bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.

  • backend (str, default None) – Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.

  • legend (bool, default False) –

    Whether to show the legend.

    New in version 1.1.0.

  • **kwargs – All other plotting keyword arguments to be passed to matplotlib.pyplot.hist().

Return type:

matplotlib.AxesSubplot or numpy.ndarray of them

See also

matplotlib.pyplot.hist

Plot a histogram using matplotlib.

Examples

This example draws a histogram based on the length and width of some animals, displayed in three bins

boxplot(column=None, by=None, ax=None, fontsize=None, rot=0, grid=True, figsize=None, layout=None, return_type=None, backend=None, **kwargs)

Make a box plot from DataFrame columns.

Make a box-and-whisker plot from DataFrame columns, optionally grouped by some other columns. A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from the edges of box to show the range of the data. By default, they extend no more than 1.5 * IQR (IQR = Q3 - Q1) from the edges of the box, ending at the farthest data point within that interval. Outliers are plotted as separate dots.

For further details see Wikipedia’s entry for boxplot.

Parameters:
  • column (str or list of str, optional) – Column name or list of names, or vector. Can be any valid input to pandas.DataFrame.groupby().

  • by (str or array-like, optional) – Column in the DataFrame to pandas.DataFrame.groupby(). One box-plot will be done per value of columns in by.

  • ax (object of class matplotlib.axes.Axes, optional) – The matplotlib axes to be used by boxplot.

  • fontsize (float or str) – Tick label font size in points or as a string (e.g., large).

  • rot (float, default 0) – The rotation angle of labels (in degrees) with respect to the screen coordinate system.

  • grid (bool, default True) – Setting this to True will show the grid.

  • figsize (A tuple (width, height) in inches) – The size of the figure to create in matplotlib.

  • layout (tuple (rows, columns), optional) – For example, (3, 5) will display the subplots using 3 rows and 5 columns, starting from the top-left.

  • return_type ({'axes', 'dict', 'both'} or None, default 'axes') –

    The kind of object to return. The default is axes.

    • ’axes’ returns the matplotlib axes the boxplot is drawn on.

    • ’dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.

    • ’both’ returns a namedtuple with the axes and dict.

    • when grouping with by, a Series mapping columns to return_type is returned.

      If return_type is None, a NumPy array of axes with the same shape as layout is returned.

  • backend (str, default None) – Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.

  • **kwargs – All other plotting keyword arguments to be passed to matplotlib.pyplot.boxplot().

Returns:

See Notes.

Return type:

result

See also

pandas.Series.plot.hist

Make a histogram.

matplotlib.pyplot.boxplot

Matplotlib equivalent plot.

Notes

The return type depends on the return_type parameter:

  • ‘axes’ : object of class matplotlib.axes.Axes

  • ‘dict’ : dict of matplotlib.lines.Line2D objects

  • ‘both’ : a namedtuple with structure (ax, lines)

For data grouped with by, return a Series of the above or a numpy array:

  • Series

  • array (for return_type = None)

Use return_type='dict' when you want to tweak the appearance of the lines after plotting. In this case a dict containing the Lines making up the boxes, caps, fliers, medians, and whiskers is returned.

Examples

Boxplots can be created for every column in the dataframe by df.boxplot() or indicating the columns to be used:

Boxplots of variables distributions grouped by the values of a third variable can be created using the option by. For instance:

A list of strings (i.e. ['X', 'Y']) can be passed to boxplot in order to group the data by combination of the variables in the x-axis:

The layout of boxplot can be adjusted giving a tuple to layout:

Additional formatting can be done to the boxplot, like suppressing the grid (grid=False), rotating the labels in the x-axis (i.e. rot=45) or changing the fontsize (i.e. fontsize=15):

The parameter return_type can be used to select the type of element returned by boxplot. When return_type='axes' is selected, the matplotlib axes on which the boxplot is drawn are returned:

>>> boxplot = df.boxplot(column=['Col1', 'Col2'], return_type='axes')
>>> type(boxplot)
<class 'matplotlib.axes._subplots.AxesSubplot'>

When grouping with by, a Series mapping columns to return_type is returned:

>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X',
...                      return_type='axes')
>>> type(boxplot)
<class 'pandas.core.series.Series'>

If return_type is None, a NumPy array of axes with the same shape as layout is returned:

>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by='X',
...                      return_type=None)
>>> type(boxplot)
<class 'numpy.ndarray'>
sparse

alias of SparseFrameAccessor

property values: ndarray

Return a Numpy representation of the DataFrame.

Warning

We recommend using DataFrame.to_numpy() instead.

Only the values in the DataFrame will be returned, the axes labels will be removed.

Returns:

The values of the DataFrame.

Return type:

numpy.ndarray

See also

DataFrame.to_numpy

Recommended alternative to this method.

DataFrame.index

Retrieve the index labels.

DataFrame.columns

Retrieving the column names.

Notes

The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes (even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if you are not dealing with the blocks.

e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8, dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64 will result in a float64 dtype.

Examples

A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.

>>> df = pd.DataFrame({'age':    [ 3,  29],
...                    'height': [94, 170],
...                    'weight': [31, 115]})
>>> df
   age  height  weight
0    3      94      31
1   29     170     115
>>> df.dtypes
age       int64
height    int64
weight    int64
dtype: object
>>> df.values
array([[  3,  94,  31],
       [ 29, 170, 115]])

A DataFrame with mixed type columns(e.g., str/object, int64, float32) results in an ndarray of the broadest type that accommodates these mixed types (e.g., object).

>>> df2 = pd.DataFrame([('parrot',   24.0, 'second'),
...                     ('lion',     80.5, 1),
...                     ('monkey', np.nan, None)],
...                   columns=('name', 'max_speed', 'rank'))
>>> df2.dtypes
name          object
max_speed    float64
rank          object
dtype: object
>>> df2.values
array([['parrot', 24.0, 'second'],
       ['lion', 80.5, 1],
       ['monkey', nan, None]], dtype=object)
ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast: dict | None = None) DataFrame[source]
ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast: dict | None = None) None
ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast: dict | None = None) DataFrame | None

Synonym for DataFrame.fillna() with method='ffill'.

Returns:

Object with missing values filled or None if inplace=True.

Return type:

Series/DataFrame or None

bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast=None) DataFrame[source]
bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast=None) None
bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast=None) DataFrame | None

Synonym for DataFrame.fillna() with method='bfill'.

Returns:

Object with missing values filled or None if inplace=True.

Return type:

Series/DataFrame or None

clip(lower=None, upper=None, *, axis=None, inplace=False, **kwargs)[source]

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
  • lower (float or array-like, default None) – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • upper (float or array-like, default None) – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Align object with lower and upper along the given axis. For Series this parameter is unused and defaults to None.

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • *args – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • self (DataFrame) –

Returns:

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

Return type:

Series or DataFrame or None

See also

Series.clip

Trim values at input threshold in series.

DataFrame.clip

Trim values at input threshold in dataframe.

numpy.clip

Clip (limit) the values in an array.

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.NaN, 6, 3])
>>> t
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64
>>> df.clip(t, axis=0)
col_0  col_1
0      9      2
1     -3     -4
2      0      6
3      6      8
4      5      3
interpolate(method='linear', *, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)[source]

Fill NaN values using an interpolation method.

Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex.

Parameters:
  • method (str, default 'linear') –

    Interpolation technique to use. One of:

    • ’linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.

    • ’time’: Works on daily and higher resolution data to interpolate given length of interval.

    • ’index’, ‘values’: use the actual numerical values of the index.

    • ’pad’: Fill in NaNs using existing values.

    • ’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d, whereas ‘spline’ is passed to scipy.interpolate.UnivariateSpline. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5). Note that, slinear method in Pandas refers to the Scipy first order spline instead of Pandas first order spline.

    • ’krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.

    • ’from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.

  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Axis to interpolate along. For Series this parameter is unused and defaults to 0.

  • limit (int, optional) – Maximum number of consecutive NaNs to fill. Must be greater than 0.

  • inplace (bool, default False) – Update the data in place if possible.

  • limit_direction ({{'forward', 'backward', 'both'}}, Optional) –

    Consecutive NaNs will be filled in this direction.

    If limit is specified:
    • If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.

    • If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.

    If ‘limit’ is not specified:
    • If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’

    • else the default is ‘forward’

    Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’.

  • limit_area ({{None, ‘inside’, ‘outside’}}, default None) –

    If limit is specified, consecutive NaNs will be filled with this restriction.

    • None: No fill restriction.

    • ’inside’: Only fill NaNs surrounded by valid values (interpolate).

    • ’outside’: Only fill NaNs outside valid values (extrapolate).

  • downcast (optional, 'infer' or None, defaults to None) – Downcast dtypes if possible.

  • **kwargs (optional) – Keyword arguments to pass on to the interpolating function.

  • self (DataFrame) –

Returns:

Returns the same object type as the caller, interpolated at some or all NaN values or None if inplace=True.

Return type:

Series or DataFrame or None

See also

fillna

Fill missing values using different methods.

scipy.interpolate.Akima1DInterpolator

Piecewise cubic polynomials (Akima interpolator).

scipy.interpolate.BPoly.from_derivatives

Piecewise polynomial in the Bernstein basis.

scipy.interpolate.interp1d

Interpolate a 1-D function.

scipy.interpolate.KroghInterpolator

Interpolate polynomial (Krogh interpolator).

scipy.interpolate.PchipInterpolator

PCHIP 1-d monotonic cubic interpolation.

scipy.interpolate.CubicSpline

Cubic spline data interpolator.

Notes

The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation.

Examples

Filling in NaN in a Series via linear interpolation.

>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64
>>> s.interpolate()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.

>>> s = pd.Series([np.nan, "single_one", np.nan,
...                "fill_two_more", np.nan, np.nan, np.nan,
...                4.71, np.nan])
>>> s
0              NaN
1       single_one
2              NaN
3    fill_two_more
4              NaN
5              NaN
6              NaN
7             4.71
8              NaN
dtype: object
>>> s.interpolate(method='pad', limit=2)
0              NaN
1       single_one
2       single_one
3    fill_two_more
4    fill_two_more
5    fill_two_more
6              NaN
7             4.71
8             4.71
dtype: object

Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods require that you also specify an order (int).

>>> s = pd.Series([0, 2, np.nan, 8])
>>> s.interpolate(method='polynomial', order=2)
0    0.000000
1    2.000000
2    4.666667
3    8.000000
dtype: float64

Fill the DataFrame forward (that is, going down) along each column using linear interpolation.

Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is no entry before it to use for interpolation.

>>> df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
...                    (np.nan, 2.0, np.nan, np.nan),
...                    (2.0, 3.0, np.nan, 9.0),
...                    (np.nan, 4.0, -4.0, 16.0)],
...                   columns=list('abcd'))
>>> df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0
>>> df.interpolate(method='linear', limit_direction='forward', axis=0)
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

Using polynomial interpolation.

>>> df['d'].interpolate(method='polynomial', order=2)
0     1.0
1     4.0
2     9.0
3    16.0
Name: d, dtype: float64
where(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame[source]
where(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
where(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame | None

Replace values where the condition is False.

Parameters:
  • cond (bool Series/DataFrame, array-like, or callable) – Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

  • other (scalar, Series/DataFrame, or callable) – Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes).

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.

  • level (int, default None) – Alignment level if needed.

Return type:

Same type as caller or None if inplace=True.

See also

DataFrame.mask()

Return an object of same shape as self.

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with False.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))
>>> t = pd.Series([True, False])
>>> s.where(t, 99)
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
mask(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame[source]
mask(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
mask(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) DataFrame | None

Replace values where the condition is True.

Parameters:
  • cond (bool Series/DataFrame, array-like, or callable) – Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

  • other (scalar, Series/DataFrame, or callable) – Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes).

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.

  • level (int, default None) – Alignment level if needed.

Return type:

Same type as caller or None if inplace=True.

See also

DataFrame.where()

Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with True.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))
>>> t = pd.Series([True, False])
>>> s.where(t, 99)
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
class pandas.DateOffset

Standard kind of date increment used for a date range.

Works exactly like the keyword argument form of relativedelta. Note that the positional argument form of relativedelata is not supported. Use of the keyword n is discouraged– you would be better off specifying n in the keywords you use, but regardless it is there for you. n is needed for DateOffset subclasses.

DateOffset works as follows. Each offset specify a set of dates that conform to the DateOffset. For example, Bday defines this set to be the set of dates that are weekdays (M-F). To test if a date is in the set of a DateOffset dateOffset we can use the is_on_offset method: dateOffset.is_on_offset(date).

If a date is not on a valid date, the rollback and rollforward methods can be used to roll the date to the nearest valid date before/after the date.

DateOffsets can be created to move dates forward a given number of valid dates. For example, Bday(2) can be added to a date to move it two business days forward. If the date does not start on a valid date, first it is moved to a valid date. Thus pseudo code is:

def __add__(date):
  date = rollback(date) # does nothing if date is valid
  return date + <n number of periods>

When a date offset is created for a negative number of periods, the date is first rolled forward. The pseudo code is:

def __add__(date):
  date = rollforward(date) # does nothing if date is valid
  return date + <n number of periods>

Zero presents a problem. Should it roll forward or back? We arbitrarily have it rollforward:

date + BDay(0) == BDay.rollforward(date)

Since 0 is a bit weird, we suggest avoiding its use.

Besides, adding a DateOffsets specified by the singular form of the date component can be used to replace certain component of the timestamp.

Parameters:
  • n (int, default 1) – The number of time periods the offset represents. If specified without a temporal pattern, defaults to n days.

  • normalize (bool, default False) – Whether to round the result of a DateOffset addition down to the previous midnight.

  • **kwds

    Temporal parameter that add to or replace the offset value.

    Parameters that add to the offset (like Timedelta):

    • years

    • months

    • weeks

    • days

    • hours

    • minutes

    • seconds

    • milliseconds

    • microseconds

    • nanoseconds

    Parameters that replace the offset value:

    • year

    • month

    • day

    • weekday

    • hour

    • minute

    • second

    • microsecond

    • nanosecond.

See also

dateutil.relativedelta.relativedelta

The relativedelta type is designed to be applied to an existing datetime an can replace specific components of that datetime, or represents an interval of time.

Examples

>>> from pandas.tseries.offsets import DateOffset
>>> ts = pd.Timestamp('2017-01-01 09:10:11')
>>> ts + DateOffset(months=3)
Timestamp('2017-04-01 09:10:11')
>>> ts = pd.Timestamp('2017-01-01 09:10:11')
>>> ts + DateOffset(months=2)
Timestamp('2017-03-01 09:10:11')
>>> ts + DateOffset(day=31)
Timestamp('2017-01-31 09:10:11')
>>> ts + pd.DateOffset(hour=8)
Timestamp('2017-01-01 08:10:11')
class pandas.DatetimeIndex[source]

Immutable ndarray-like of datetime64 data.

Represented internally as int64, and which can be boxed to Timestamp objects that are subclasses of datetime and carry metadata.

Changed in version 2.0.0: The various numeric date/time attributes (day, month, year etc.) now have dtype int32. Previously they had dtype int64.

Parameters:
  • data (array-like (1-dimensional)) – Datetime-like data to construct index with.

  • freq (str or pandas offset object, optional) – One of pandas date offset strings or corresponding objects. The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation.

  • tz (pytz.timezone or dateutil.tz.tzfile or datetime.tzinfo or str) – Set the Timezone of the data.

  • normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.

  • closed ({'left', 'right'}, optional) – Set whether to include start and end that are on the boundary. The default includes boundary points on either end.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • dayfirst (bool, default False) – If True, parse dates in data with the day first order.

  • yearfirst (bool, default False) – If True parse dates in data with the year first order.

  • dtype (numpy.dtype or DatetimeTZDtype or str, default None) – Note that the only NumPy dtype allowed is ‘datetime64[ns]’.

  • copy (bool, default False) – Make a copy of input ndarray.

  • name (label, default None) – Name to be stored in the index.

Return type:

DatetimeIndex

year
month
day
hour
minute
second
microsecond
nanosecond
date
time
timetz
dayofyear
day_of_year
weekofyear
week
dayofweek
day_of_week
weekday
quarter
tz
Type:

dt.tzinfo | None

freq
freqstr
is_month_start
is_month_end
is_quarter_start
is_quarter_end
is_year_start
is_year_end
is_leap_year
inferred_freq
normalize()
strftime()[source]
Return type:

Index

snap()[source]
Parameters:

freq (Frequency) –

Return type:

DatetimeIndex

tz_convert()[source]
Return type:

DatetimeIndex

tz_localize()[source]
Parameters:
  • ambiguous (TimeAmbiguous) –

  • nonexistent (TimeNonexistent) –

Return type:

DatetimeIndex

round()
floor()
ceil()
to_period()
to_pydatetime()
to_series()
to_frame()
month_name()
day_name()
mean()
std()

See also

Index

The base pandas Index type.

TimedeltaIndex

Index of timedelta64 data.

PeriodIndex

Index of Period data.

to_datetime

Convert argument to datetime.

date_range

Create a fixed-frequency DatetimeIndex.

Notes

To learn more about the frequency strings, please see this link.

property tz

Return the timezone.

Returns:

Returns None when the array is tz-naive.

Return type:

datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None

strftime(date_format)[source]

Convert to Index using specified date_format.

Return an Index of formatted strings specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc.

Formats supported by the C strftime API but not by the python string format doc (such as “%R”, “%r”) are not officially supported and should be preferably replaced with their supported equivalents (such as “%H:%M”, “%I:%M:%S %p”).

Note that PeriodIndex support additional directives, detailed in Period.strftime.

Parameters:

date_format (str) – Date format string (e.g. “%Y-%m-%d”).

Returns:

NumPy ndarray of formatted strings.

Return type:

ndarray[object]

See also

to_datetime

Convert the given argument to datetime.

DatetimeIndex.normalize

Return DatetimeIndex with times to midnight.

DatetimeIndex.round

Round the DatetimeIndex to the specified freq.

DatetimeIndex.floor

Floor the DatetimeIndex to the specified freq.

Timestamp.strftime

Format a single Timestamp.

Period.strftime

Format a single Period.

Examples

>>> rng = pd.date_range(pd.Timestamp("2018-03-10 09:00"),
...                     periods=3, freq='s')
>>> rng.strftime('%B %d, %Y, %r')
Index(['March 10, 2018, 09:00:00 AM', 'March 10, 2018, 09:00:01 AM',
       'March 10, 2018, 09:00:02 AM'],
      dtype='object')
tz_convert(tz)[source]

Convert tz-aware Datetime Array/Index from one time zone to another.

Parameters:

tz (str, pytz.timezone, dateutil.tz.tzfile, datetime.tzinfo or None) – Time zone for time. Corresponding timestamps would be converted to this time zone of the Datetime Array/Index. A tz of None will convert to UTC and remove the timezone information.

Return type:

Array or Index

Raises:

TypeError – If Datetime Array/Index is tz-naive.

See also

DatetimeIndex.tz

A timezone that has a variable offset from UTC.

DatetimeIndex.tz_localize

Localize tz-naive DatetimeIndex to a given time zone, or remove timezone from a tz-aware DatetimeIndex.

Examples

With the tz parameter, we can change the DatetimeIndex to other time zones:

>>> dti = pd.date_range(start='2014-08-01 09:00',
...                     freq='H', periods=3, tz='Europe/Berlin')
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
               '2014-08-01 10:00:00+02:00',
               '2014-08-01 11:00:00+02:00'],
              dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert('US/Central')
DatetimeIndex(['2014-08-01 02:00:00-05:00',
               '2014-08-01 03:00:00-05:00',
               '2014-08-01 04:00:00-05:00'],
              dtype='datetime64[ns, US/Central]', freq='H')

With the tz=None, we can remove the timezone (after converting to UTC if necessary):

>>> dti = pd.date_range(start='2014-08-01 09:00', freq='H',
...                     periods=3, tz='Europe/Berlin')
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
               '2014-08-01 10:00:00+02:00',
               '2014-08-01 11:00:00+02:00'],
                dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert(None)
DatetimeIndex(['2014-08-01 07:00:00',
               '2014-08-01 08:00:00',
               '2014-08-01 09:00:00'],
                dtype='datetime64[ns]', freq='H')
tz_localize(tz, ambiguous='raise', nonexistent='raise')[source]

Localize tz-naive Datetime Array/Index to tz-aware Datetime Array/Index.

This method takes a time zone (tz) naive Datetime Array/Index object and makes this time zone aware. It does not move the time to another time zone.

This method can also be used to do the inverse – to create a time zone unaware object from an aware object. To that end, pass tz=None.

Parameters:
  • tz (str, pytz.timezone, dateutil.tz.tzfile, datetime.tzinfo or None) – Time zone to convert timestamps to. Passing None will remove the time zone information preserving local time.

  • ambiguous ('infer', 'NaT', bool array, default 'raise') –

    When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False signifies a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward, 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Array/Index converted to the specified time zone.

Return type:

Same type as self

Raises:

TypeError – If the Datetime Array/Index is tz-aware and tz is not None.

See also

DatetimeIndex.tz_convert

Convert tz-aware DatetimeIndex from one time zone to another.

Examples

>>> tz_naive = pd.date_range('2018-03-01 09:00', periods=3)
>>> tz_naive
DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
               '2018-03-03 09:00:00'],
              dtype='datetime64[ns]', freq='D')

Localize DatetimeIndex in US/Eastern time zone:

>>> tz_aware = tz_naive.tz_localize(tz='US/Eastern')
>>> tz_aware
DatetimeIndex(['2018-03-01 09:00:00-05:00',
               '2018-03-02 09:00:00-05:00',
               '2018-03-03 09:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq=None)

With the tz=None, we can remove the time zone information while keeping the local time (not converted to UTC):

>>> tz_aware.tz_localize(None)
DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
               '2018-03-03 09:00:00'],
              dtype='datetime64[ns]', freq=None)

Be careful with DST changes. When there is sequential data, pandas can infer the DST time:

>>> s = pd.to_datetime(pd.Series(['2018-10-28 01:30:00',
...                               '2018-10-28 02:00:00',
...                               '2018-10-28 02:30:00',
...                               '2018-10-28 02:00:00',
...                               '2018-10-28 02:30:00',
...                               '2018-10-28 03:00:00',
...                               '2018-10-28 03:30:00']))
>>> s.dt.tz_localize('CET', ambiguous='infer')
0   2018-10-28 01:30:00+02:00
1   2018-10-28 02:00:00+02:00
2   2018-10-28 02:30:00+02:00
3   2018-10-28 02:00:00+01:00
4   2018-10-28 02:30:00+01:00
5   2018-10-28 03:00:00+01:00
6   2018-10-28 03:30:00+01:00
dtype: datetime64[ns, CET]

In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous parameter to set the DST explicitly

>>> s = pd.to_datetime(pd.Series(['2018-10-28 01:20:00',
...                               '2018-10-28 02:36:00',
...                               '2018-10-28 03:46:00']))
>>> s.dt.tz_localize('CET', ambiguous=np.array([True, True, False]))
0   2018-10-28 01:20:00+02:00
1   2018-10-28 02:36:00+02:00
2   2018-10-28 03:46:00+01:00
dtype: datetime64[ns, CET]

If the DST transition causes nonexistent times, you can shift these dates forward or backwards with a timedelta object or ‘shift_forward’ or ‘shift_backwards’.

>>> s = pd.to_datetime(pd.Series(['2015-03-29 02:30:00',
...                               '2015-03-29 03:30:00']))
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
0   2015-03-29 03:00:00+02:00
1   2015-03-29 03:30:00+02:00
dtype: datetime64[ns, Europe/Warsaw]
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
0   2015-03-29 01:59:59.999999999+01:00
1   2015-03-29 03:30:00+02:00
dtype: datetime64[ns, Europe/Warsaw]
>>> s.dt.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H'))
0   2015-03-29 03:30:00+02:00
1   2015-03-29 03:30:00+02:00
dtype: datetime64[ns, Europe/Warsaw]
to_period(*args, **kwargs)

Cast to PeriodArray/Index at a particular frequency.

Converts DatetimeArray/Index to PeriodArray/Index.

Parameters:

freq (str or Offset, optional) – One of pandas’ offset strings or an Offset object. Will be inferred by default.

Return type:

PeriodArray/Index

Raises:

ValueError – When converting a DatetimeArray/Index with non-regular values, so that a frequency cannot be inferred.

See also

PeriodIndex

Immutable ndarray holding ordinal values.

DatetimeIndex.to_pydatetime

Return DatetimeIndex as object.

Examples

>>> df = pd.DataFrame({"y": [1, 2, 3]},
...                   index=pd.to_datetime(["2000-03-31 00:00:00",
...                                         "2000-05-31 00:00:00",
...                                         "2000-08-31 00:00:00"]))
>>> df.index.to_period("M")
PeriodIndex(['2000-03', '2000-05', '2000-08'],
            dtype='period[M]')

Infer the daily frequency

>>> idx = pd.date_range("2017-01-01", periods=2)
>>> idx.to_period()
PeriodIndex(['2017-01-01', '2017-01-02'],
            dtype='period[D]')
to_julian_date()[source]

Convert Datetime Array to float64 ndarray of Julian Dates. 0 Julian date is noon January 1, 4713 BC. https://en.wikipedia.org/wiki/Julian_day

Return type:

Index

isocalendar()[source]

Calculate year, week, and day according to the ISO 8601 standard.

New in version 1.1.0.

Returns:

With columns year, week and day.

Return type:

DataFrame

See also

Timestamp.isocalendar

Function return a 3-tuple containing ISO year, week number, and weekday for the given Timestamp object.

datetime.date.isocalendar

Return a named tuple object with three components: year, week and weekday.

Examples

>>> idx = pd.date_range(start='2019-12-29', freq='D', periods=4)
>>> idx.isocalendar()
            year  week  day
2019-12-29  2019    52    7
2019-12-30  2020     1    1
2019-12-31  2020     1    2
2020-01-01  2020     1    3
>>> idx.isocalendar().week
2019-12-29    52
2019-12-30     1
2019-12-31     1
2020-01-01     1
Freq: D, Name: week, dtype: UInt32
snap(freq='S')[source]

Snap time stamps to nearest occurring frequency.

Return type:

DatetimeIndex

Parameters:

freq (Frequency) –

get_loc(key)[source]

Get integer location for requested label

Returns:

loc

Return type:

int

slice_indexer(start=None, end=None, step=None)[source]

Return indexer for specified label slice. Index.slice_indexer, customized to handle time slicing.

In addition to functionality provided by Index.slice_indexer, does the following:

  • if both start and end are instances of datetime.time, it invokes indexer_between_time

  • if start and end are both either string or None perform value-based selection in non-monotonic cases.

property inferred_type: str

Return a string of the type inferred from the values.

indexer_at_time(time, asof=False)[source]

Return index locations of values at particular time of day.

Parameters:
  • time (datetime.time or str) – Time passed in either as object (datetime.time) or as string in appropriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”, “%H%M%S”, “%I:%M:%S%p”, “%I%M%S%p”).

  • asof (bool) –

Return type:

np.ndarray[np.intp]

See also

indexer_between_time

Get index locations of values between particular times of day.

DataFrame.at_time

Select values at particular time of day.

indexer_between_time(start_time, end_time, include_start=True, include_end=True)[source]

Return index locations of values between particular times of day.

Parameters:
  • start_time (datetime.time, str) – Time passed either as object (datetime.time) or as string in appropriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”, “%H%M%S”, “%I:%M:%S%p”,”%I%M%S%p”).

  • end_time (datetime.time, str) – Time passed either as object (datetime.time) or as string in appropriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”, “%H%M%S”, “%I:%M:%S%p”,”%I%M%S%p”).

  • include_start (bool, default True) –

  • include_end (bool, default True) –

Return type:

np.ndarray[np.intp]

See also

indexer_at_time

Get index locations of values at particular time of day.

DataFrame.between_time

Select values between particular times of day.

as_unit(*args, **kwargs)

Convert to a dtype with the given unit resolution.

Parameters:

unit ({'s', 'ms', 'us', 'ns'}) –

Return type:

same type as self

ceil(*args, **kwargs)

Perform ceil operation on the data to the specified freq.

Parameters:
  • freq (str or Offset) – The frequency level to ceil the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    Only relevant for DatetimeIndex:

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.

Return type:

DatetimeIndex, TimedeltaIndex, or Series

Raises:

ValueError if the freq cannot be converted.

Notes

If the timestamps have a timezone, ceiling will take place relative to the local (“wall”) time and re-localized to the same timezone. When ceiling near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

DatetimeIndex

>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
>>> rng
DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
               '2018-01-01 12:01:00'],
              dtype='datetime64[ns]', freq='T')
>>> rng.ceil('H')
DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00',
               '2018-01-01 13:00:00'],
              dtype='datetime64[ns]', freq=None)

Series

>>> pd.Series(rng).dt.ceil("H")
0   2018-01-01 12:00:00
1   2018-01-01 12:00:00
2   2018-01-01 13:00:00
dtype: datetime64[ns]

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> rng_tz = pd.DatetimeIndex(["2021-10-31 01:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.ceil("H", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.ceil("H", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
property date

Returns numpy array of python datetime.date objects.

Namely, the date part of Timestamps without time and timezone information.

property day

The day of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="D")
... )
>>> datetime_series
0   2000-01-01
1   2000-01-02
2   2000-01-03
dtype: datetime64[ns]
>>> datetime_series.dt.day
0    1
1    2
2    3
dtype: int32
day_name(*args, **kwargs)

Return the day names with specified locale.

Parameters:

locale (str, optional) – Locale determining the language in which to return the day name. Default is English locale ('en_US.utf8'). Use the command locale -a on your terminal on Unix systems to find your locale language code.

Returns:

Series or Index of day names.

Return type:

Series or Index

Examples

>>> s = pd.Series(pd.date_range(start='2018-01-01', freq='D', periods=3))
>>> s
0   2018-01-01
1   2018-01-02
2   2018-01-03
dtype: datetime64[ns]
>>> s.dt.day_name()
0       Monday
1      Tuesday
2    Wednesday
dtype: object
>>> idx = pd.date_range(start='2018-01-01', freq='D', periods=3)
>>> idx
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'],
              dtype='datetime64[ns]', freq='D')
>>> idx.day_name()
Index(['Monday', 'Tuesday', 'Wednesday'], dtype='object')

Using the locale parameter you can set a different locale language, for example: idx.day_name(locale='pt_BR.utf8') will return day names in Brazilian Portuguese language.

>>> idx = pd.date_range(start='2018-01-01', freq='D', periods=3)
>>> idx
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03'],
              dtype='datetime64[ns]', freq='D')
>>> idx.day_name(locale='pt_BR.utf8') 
Index(['Segunda', 'Terça', 'Quarta'], dtype='object')
property day_of_week

The day of the week with Monday=0, Sunday=6.

Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.

Returns:

Containing integers indicating the day number.

Return type:

Series or Index

See also

Series.dt.dayofweek

Alias.

Series.dt.weekday

Alias.

Series.dt.day_name

Returns the name of the day of the week.

Examples

>>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
>>> s.dt.dayofweek
2016-12-31    5
2017-01-01    6
2017-01-02    0
2017-01-03    1
2017-01-04    2
2017-01-05    3
2017-01-06    4
2017-01-07    5
2017-01-08    6
Freq: D, dtype: int32
property day_of_year

The ordinal day of the year.

property dayofweek

The day of the week with Monday=0, Sunday=6.

Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.

Returns:

Containing integers indicating the day number.

Return type:

Series or Index

See also

Series.dt.dayofweek

Alias.

Series.dt.weekday

Alias.

Series.dt.day_name

Returns the name of the day of the week.

Examples

>>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
>>> s.dt.dayofweek
2016-12-31    5
2017-01-01    6
2017-01-02    0
2017-01-03    1
2017-01-04    2
2017-01-05    3
2017-01-06    4
2017-01-07    5
2017-01-08    6
Freq: D, dtype: int32
property dayofyear

The ordinal day of the year.

property days_in_month

The number of days in the month.

property daysinmonth

The number of days in the month.

property dtype

The dtype for the DatetimeArray.

Warning

A future version of pandas will change dtype to never be a numpy.dtype. Instead, DatetimeArray.dtype will always be an instance of an ExtensionDtype subclass.

Returns:

If the values are tz-naive, then np.dtype('datetime64[ns]') is returned.

If the values are tz-aware, then the DatetimeTZDtype is returned.

Return type:

numpy.dtype or DatetimeTZDtype

floor(*args, **kwargs)

Perform floor operation on the data to the specified freq.

Parameters:
  • freq (str or Offset) – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    Only relevant for DatetimeIndex:

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.

Return type:

DatetimeIndex, TimedeltaIndex, or Series

Raises:

ValueError if the freq cannot be converted.

Notes

If the timestamps have a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

DatetimeIndex

>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
>>> rng
DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
               '2018-01-01 12:01:00'],
              dtype='datetime64[ns]', freq='T')
>>> rng.floor('H')
DatetimeIndex(['2018-01-01 11:00:00', '2018-01-01 12:00:00',
               '2018-01-01 12:00:00'],
              dtype='datetime64[ns]', freq=None)

Series

>>> pd.Series(rng).dt.floor("H")
0   2018-01-01 11:00:00
1   2018-01-01 12:00:00
2   2018-01-01 12:00:00
dtype: datetime64[ns]

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
             dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
property hour

The hours of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="h")
... )
>>> datetime_series
0   2000-01-01 00:00:00
1   2000-01-01 01:00:00
2   2000-01-01 02:00:00
dtype: datetime64[ns]
>>> datetime_series.dt.hour
0    0
1    1
2    2
dtype: int32
property is_leap_year

Boolean indicator if the date belongs to a leap year.

A leap year is a year, which has 366 days (instead of 365) including 29th of February as an intercalary day. Leap years are years which are multiples of four with the exception of years divisible by 100 but not by 400.

Returns:

Booleans indicating if dates belong to a leap year.

Return type:

Series or ndarray

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> idx = pd.date_range("2012-01-01", "2015-01-01", freq="Y")
>>> idx
DatetimeIndex(['2012-12-31', '2013-12-31', '2014-12-31'],
              dtype='datetime64[ns]', freq='A-DEC')
>>> idx.is_leap_year
array([ True, False, False])
>>> dates_series = pd.Series(idx)
>>> dates_series
0   2012-12-31
1   2013-12-31
2   2014-12-31
dtype: datetime64[ns]
>>> dates_series.dt.is_leap_year
0     True
1    False
2    False
dtype: bool
property is_month_end

Indicates whether the date is the last day of the month.

Returns:

For Series, returns a Series with boolean values. For DatetimeIndex, returns a boolean array.

Return type:

Series or array

See also

is_month_start

Return a boolean indicating whether the date is the first day of the month.

is_month_end

Return a boolean indicating whether the date is the last day of the month.

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
>>> s
0   2018-02-27
1   2018-02-28
2   2018-03-01
dtype: datetime64[ns]
>>> s.dt.is_month_start
0    False
1    False
2    True
dtype: bool
>>> s.dt.is_month_end
0    False
1    True
2    False
dtype: bool
>>> idx = pd.date_range("2018-02-27", periods=3)
>>> idx.is_month_start
array([False, False, True])
>>> idx.is_month_end
array([False, True, False])
property is_month_start

Indicates whether the date is the first day of the month.

Returns:

For Series, returns a Series with boolean values. For DatetimeIndex, returns a boolean array.

Return type:

Series or array

See also

is_month_start

Return a boolean indicating whether the date is the first day of the month.

is_month_end

Return a boolean indicating whether the date is the last day of the month.

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
>>> s
0   2018-02-27
1   2018-02-28
2   2018-03-01
dtype: datetime64[ns]
>>> s.dt.is_month_start
0    False
1    False
2    True
dtype: bool
>>> s.dt.is_month_end
0    False
1    True
2    False
dtype: bool
>>> idx = pd.date_range("2018-02-27", periods=3)
>>> idx.is_month_start
array([False, False, True])
>>> idx.is_month_end
array([False, True, False])
is_normalized

Returns True if all of the dates are at midnight (“no time”)

property is_quarter_end

Indicator for whether the date is the last day of a quarter.

Returns:

is_quarter_end – The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.

Return type:

Series or DatetimeIndex

See also

quarter

Return the quarter of the date.

is_quarter_start

Similar property indicating the quarter start.

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30",
...                    periods=4)})
>>> df.assign(quarter=df.dates.dt.quarter,
...           is_quarter_end=df.dates.dt.is_quarter_end)
       dates  quarter    is_quarter_end
0 2017-03-30        1             False
1 2017-03-31        1              True
2 2017-04-01        2             False
3 2017-04-02        2             False
>>> idx = pd.date_range('2017-03-30', periods=4)
>>> idx
DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'],
              dtype='datetime64[ns]', freq='D')
>>> idx.is_quarter_end
array([False,  True, False, False])
property is_quarter_start

Indicator for whether the date is the first day of a quarter.

Returns:

is_quarter_start – The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.

Return type:

Series or DatetimeIndex

See also

quarter

Return the quarter of the date.

is_quarter_end

Similar property for indicating the quarter end.

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> df = pd.DataFrame({'dates': pd.date_range("2017-03-30",
...                   periods=4)})
>>> df.assign(quarter=df.dates.dt.quarter,
...           is_quarter_start=df.dates.dt.is_quarter_start)
       dates  quarter  is_quarter_start
0 2017-03-30        1             False
1 2017-03-31        1             False
2 2017-04-01        2              True
3 2017-04-02        2             False
>>> idx = pd.date_range('2017-03-30', periods=4)
>>> idx
DatetimeIndex(['2017-03-30', '2017-03-31', '2017-04-01', '2017-04-02'],
              dtype='datetime64[ns]', freq='D')
>>> idx.is_quarter_start
array([False, False,  True, False])
property is_year_end

Indicate whether the date is the last day of the year.

Returns:

The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.

Return type:

Series or DatetimeIndex

See also

is_year_start

Similar property indicating the start of the year.

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> dates = pd.Series(pd.date_range("2017-12-30", periods=3))
>>> dates
0   2017-12-30
1   2017-12-31
2   2018-01-01
dtype: datetime64[ns]
>>> dates.dt.is_year_end
0    False
1     True
2    False
dtype: bool
>>> idx = pd.date_range("2017-12-30", periods=3)
>>> idx
DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'],
              dtype='datetime64[ns]', freq='D')
>>> idx.is_year_end
array([False,  True, False])
property is_year_start

Indicate whether the date is the first day of a year.

Returns:

The same type as the original data with boolean values. Series will have the same name and index. DatetimeIndex will have the same name.

Return type:

Series or DatetimeIndex

See also

is_year_end

Similar property indicating the last day of the year.

Examples

This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.

>>> dates = pd.Series(pd.date_range("2017-12-30", periods=3))
>>> dates
0   2017-12-30
1   2017-12-31
2   2018-01-01
dtype: datetime64[ns]
>>> dates.dt.is_year_start
0    False
1    False
2    True
dtype: bool
>>> idx = pd.date_range("2017-12-30", periods=3)
>>> idx
DatetimeIndex(['2017-12-30', '2017-12-31', '2018-01-01'],
              dtype='datetime64[ns]', freq='D')
>>> idx.is_year_start
array([False, False,  True])
property microsecond

The microseconds of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="us")
... )
>>> datetime_series
0   2000-01-01 00:00:00.000000
1   2000-01-01 00:00:00.000001
2   2000-01-01 00:00:00.000002
dtype: datetime64[ns]
>>> datetime_series.dt.microsecond
0       0
1       1
2       2
dtype: int32
property minute

The minutes of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="T")
... )
>>> datetime_series
0   2000-01-01 00:00:00
1   2000-01-01 00:01:00
2   2000-01-01 00:02:00
dtype: datetime64[ns]
>>> datetime_series.dt.minute
0    0
1    1
2    2
dtype: int32
property month

The month as January=1, December=12.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="M")
... )
>>> datetime_series
0   2000-01-31
1   2000-02-29
2   2000-03-31
dtype: datetime64[ns]
>>> datetime_series.dt.month
0    1
1    2
2    3
dtype: int32
month_name(*args, **kwargs)

Return the month names with specified locale.

Parameters:

locale (str, optional) – Locale determining the language in which to return the month name. Default is English locale ('en_US.utf8'). Use the command locale -a on your terminal on Unix systems to find your locale language code.

Returns:

Series or Index of month names.

Return type:

Series or Index

Examples

>>> s = pd.Series(pd.date_range(start='2018-01', freq='M', periods=3))
>>> s
0   2018-01-31
1   2018-02-28
2   2018-03-31
dtype: datetime64[ns]
>>> s.dt.month_name()
0     January
1    February
2       March
dtype: object
>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
              dtype='datetime64[ns]', freq='M')
>>> idx.month_name()
Index(['January', 'February', 'March'], dtype='object')

Using the locale parameter you can set a different locale language, for example: idx.month_name(locale='pt_BR.utf8') will return month names in Brazilian Portuguese language.

>>> idx = pd.date_range(start='2018-01', freq='M', periods=3)
>>> idx
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31'],
              dtype='datetime64[ns]', freq='M')
>>> idx.month_name(locale='pt_BR.utf8') 
Index(['Janeiro', 'Fevereiro', 'Março'], dtype='object')
property nanosecond

The nanoseconds of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="ns")
... )
>>> datetime_series
0   2000-01-01 00:00:00.000000000
1   2000-01-01 00:00:00.000000001
2   2000-01-01 00:00:00.000000002
dtype: datetime64[ns]
>>> datetime_series.dt.nanosecond
0       0
1       1
2       2
dtype: int32
normalize(*args, **kwargs)

Convert times to midnight.

The time component of the date-time is converted to midnight i.e. 00:00:00. This is useful in cases, when the time does not matter. Length is unaltered. The timezones are unaffected.

This method is available on Series with datetime values under the .dt accessor, and directly on Datetime Array/Index.

Returns:

The same type as the original data. Series will have the same name and index. DatetimeIndex will have the same name.

Return type:

DatetimeArray, DatetimeIndex or Series

See also

floor

Floor the datetimes to the specified freq.

ceil

Ceil the datetimes to the specified freq.

round

Round the datetimes to the specified freq.

Examples

>>> idx = pd.date_range(start='2014-08-01 10:00', freq='H',
...                     periods=3, tz='Asia/Calcutta')
>>> idx
DatetimeIndex(['2014-08-01 10:00:00+05:30',
               '2014-08-01 11:00:00+05:30',
               '2014-08-01 12:00:00+05:30'],
                dtype='datetime64[ns, Asia/Calcutta]', freq='H')
>>> idx.normalize()
DatetimeIndex(['2014-08-01 00:00:00+05:30',
               '2014-08-01 00:00:00+05:30',
               '2014-08-01 00:00:00+05:30'],
               dtype='datetime64[ns, Asia/Calcutta]', freq=None)
property quarter

The quarter of the date.

round(*args, **kwargs)

Perform round operation on the data to the specified freq.

Parameters:
  • freq (str or Offset) – The frequency level to round the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    Only relevant for DatetimeIndex:

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.

Return type:

DatetimeIndex, TimedeltaIndex, or Series

Raises:

ValueError if the freq cannot be converted.

Notes

If the timestamps have a timezone, rounding will take place relative to the local (“wall”) time and re-localized to the same timezone. When rounding near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

DatetimeIndex

>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
>>> rng
DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
               '2018-01-01 12:01:00'],
              dtype='datetime64[ns]', freq='T')
>>> rng.round('H')
DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00',
               '2018-01-01 12:00:00'],
              dtype='datetime64[ns]', freq=None)

Series

>>> pd.Series(rng).dt.round("H")
0   2018-01-01 12:00:00
1   2018-01-01 12:00:00
2   2018-01-01 12:00:00
dtype: datetime64[ns]

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
property second

The seconds of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="s")
... )
>>> datetime_series
0   2000-01-01 00:00:00
1   2000-01-01 00:00:01
2   2000-01-01 00:00:02
dtype: datetime64[ns]
>>> datetime_series.dt.second
0    0
1    1
2    2
dtype: int32
std(*args, **kwargs)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis (int optional, default None) – Axis for the function to be applied on. For Series this parameter is unused and defaults to None.

  • ddof (int, default 1) – Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

Return type:

Timedelta

property time

Returns numpy array of datetime.time objects.

The time part of the Timestamps.

property timetz

Returns numpy array of datetime.time objects with timezones.

The time part of the Timestamps.

to_pydatetime(*args, **kwargs)

Return an ndarray of datetime.datetime objects.

Return type:

numpy.ndarray

property tzinfo

Alias for tz attribute

property weekday

The day of the week with Monday=0, Sunday=6.

Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or DatetimeIndex.

Returns:

Containing integers indicating the day number.

Return type:

Series or Index

See also

Series.dt.dayofweek

Alias.

Series.dt.weekday

Alias.

Series.dt.day_name

Returns the name of the day of the week.

Examples

>>> s = pd.date_range('2016-12-31', '2017-01-08', freq='D').to_series()
>>> s.dt.dayofweek
2016-12-31    5
2017-01-01    6
2017-01-02    0
2017-01-03    1
2017-01-04    2
2017-01-05    3
2017-01-06    4
2017-01-07    5
2017-01-08    6
Freq: D, dtype: int32
property year

The year of the datetime.

Examples

>>> datetime_series = pd.Series(
...     pd.date_range("2000-01-01", periods=3, freq="Y")
... )
>>> datetime_series
0   2000-12-31
1   2001-12-31
2   2002-12-31
dtype: datetime64[ns]
>>> datetime_series.dt.year
0    2000
1    2001
2    2002
dtype: int32
class pandas.DatetimeTZDtype[source]

An ExtensionDtype for timezone-aware datetime data.

This is not an actual numpy dtype, but a duck type.

Parameters:
  • unit (str, default "ns") – The precision of the datetime data. Currently limited to "ns".

  • tz (str, int, or datetime.tzinfo) – The timezone.

unit
tz
None()
Raises:

pytz.UnknownTimeZoneError – When the requested timezone cannot be found.

Parameters:

unit (str_type | DatetimeTZDtype) –

Examples

>>> pd.DatetimeTZDtype(tz='UTC')
datetime64[ns, UTC]
>>> pd.DatetimeTZDtype(tz='dateutil/US/Central')
datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
type

alias of Timestamp

kind: str = 'M'
num = 101
base: dtype | ExtensionDtype | None = dtype('<M8[ns]')
property na_value: NaTType

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

str: str
property unit: str

The precision of the datetime data.

property tz: tzinfo

The timezone.

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

classmethod construct_from_string(string)[source]

Construct a DatetimeTZDtype from a string.

Parameters:

string (str) – The string alias for this DatetimeTZDtype. Should be formatted like datetime64[ns, <tz>], where <tz> is the timezone name.

Return type:

DatetimeTZDtype

Examples

>>> DatetimeTZDtype.construct_from_string('datetime64[ns, UTC]')
datetime64[ns, UTC]
property name: str

A string representation of the dtype.

class pandas.ExcelFile[source]

Class for parsing tabular Excel sheets into DataFrame objects.

See read_excel for more documentation.

Parameters:
  • path_or_buffer (str, bytes, path object (pathlib.Path or py._path.local.LocalPath),) – A file-like object, xlrd workbook or openpyxl workbook. If a string or path object, expected to be a path to a .xls, .xlsx, .xlsb, .xlsm, .odf, .ods, or .odt file.

  • engine (str, default None) –

    If io is not a buffer or path, this must be set to identify io. Supported engines: xlrd, openpyxl, odf, pyxlsb Engine compatibility :

    • xlrd supports old-style Excel files (.xls).

    • openpyxl supports newer Excel file formats.

    • odf supports OpenDocument file formats (.odf, .ods, .odt).

    • pyxlsb supports Binary Excel files.

    Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:

    • If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.

    • Otherwise if path_or_buffer is an xls format, xlrd will be used.

    • Otherwise if path_or_buffer is in xlsb format, pyxlsb will be used.

    New in version 1.3.0.

    • Otherwise if openpyxl is installed, then openpyxl will be used.

    • Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.

    Warning

    Please do not report issues when using xlrd to read .xlsx files. This is not supported, switch to using openpyxl instead.

  • storage_options (StorageOptions) –

class ODFReader
Parameters:
  • filepath_or_buffer (FilePath | ReadBuffer[bytes]) –

  • storage_options (StorageOptions) –

property empty_value: str

Property for compat with other readers.

get_sheet_by_index(index)
Parameters:

index (int) –

get_sheet_by_name(name)
Parameters:

name (str) –

get_sheet_data(sheet, file_rows_needed=None)

Parse an ODF Table into a list of lists

Parameters:

file_rows_needed (int | None) –

Return type:

list[list[Scalar | NaTType]]

load_workbook(filepath_or_buffer)
Parameters:

filepath_or_buffer (FilePath | ReadBuffer[bytes]) –

property sheet_names: list[str]

Return a list of sheet names present in the document

class OpenpyxlReader
Parameters:
  • filepath_or_buffer (FilePath | ReadBuffer[bytes]) –

  • storage_options (StorageOptions) –

get_sheet_by_index(index)
Parameters:

index (int) –

get_sheet_by_name(name)
Parameters:

name (str) –

get_sheet_data(sheet, file_rows_needed=None)
Parameters:

file_rows_needed (int | None) –

Return type:

list[list[Scalar]]

load_workbook(filepath_or_buffer)
Parameters:

filepath_or_buffer (FilePath | ReadBuffer[bytes]) –

property sheet_names: list[str]
class PyxlsbReader
Parameters:
  • filepath_or_buffer (FilePath | ReadBuffer[bytes]) –

  • storage_options (StorageOptions) –

get_sheet_by_index(index)
Parameters:

index (int) –

get_sheet_by_name(name)
Parameters:

name (str) –

get_sheet_data(sheet, file_rows_needed=None)
Parameters:

file_rows_needed (int | None) –

Return type:

list[list[Scalar]]

load_workbook(filepath_or_buffer)
Parameters:

filepath_or_buffer (FilePath | ReadBuffer[bytes]) –

property sheet_names: list[str]
class XlrdReader
Parameters:

storage_options (StorageOptions) –

get_sheet_by_index(index)
get_sheet_by_name(name)
get_sheet_data(sheet, file_rows_needed=None)
Parameters:

file_rows_needed (int | None) –

Return type:

list[list[Scalar]]

load_workbook(filepath_or_buffer)
property sheet_names
parse(sheet_name=0, header=0, names=None, index_col=None, usecols=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, parse_dates=False, date_parser=_NoDefault.no_default, date_format=None, thousands=None, comment=None, skipfooter=0, dtype_backend=_NoDefault.no_default, **kwds)[source]

Parse specified sheet(s) into a DataFrame.

Equivalent to read_excel(ExcelFile, …) See the read_excel docstring for more info on accepted parameters.

Returns:

DataFrame from the passed in Excel file.

Return type:

DataFrame or dict of DataFrames

Parameters:
property book
property sheet_names
close()[source]

close io if necessary

Return type:

None

class pandas.ExcelWriter[source]

Class for writing DataFrame objects into excel sheets.

Default is to use:

See DataFrame.to_excel for typical usage.

The writer should be used as a context manager. Otherwise, call close() to save and close any opened file handles.

Parameters:
  • path (str or BinaryIO) – Path to xls or xlsx or ods file.

  • engine (str (optional)) – Engine to use for writing. If None, defaults to io.excel.<extension>.writer. NOTE: can only be passed as a keyword argument.

  • date_format (str, default None) – Format string for dates written into Excel files (e.g. ‘YYYY-MM-DD’).

  • datetime_format (str, default None) – Format string for datetime objects written into Excel files. (e.g. ‘YYYY-MM-DD HH:MM:SS’).

  • mode ({'w', 'a'}, default 'w') – File mode to use (write or append). Append does not work with fsspec URLs.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • if_sheet_exists ({'error', 'new', 'replace', 'overlay'}, default 'error') –

    How to behave when trying to write to a sheet that already exists (append mode only).

    • error: raise a ValueError.

    • new: Create a new sheet, with a name determined by the engine.

    • replace: Delete the contents of the sheet before writing to it.

    • overlay: Write contents to the existing sheet without removing the old contents.

    New in version 1.3.0.

    Changed in version 1.4.0: Added overlay option

  • engine_kwargs (dict, optional) –

    Keyword arguments to be passed into the engine. These will be passed to the following functions of the respective engines:

    • xlsxwriter: xlsxwriter.Workbook(file, **engine_kwargs)

    • openpyxl (write mode): openpyxl.Workbook(**engine_kwargs)

    • openpyxl (append mode): openpyxl.load_workbook(file, **engine_kwargs)

    • odswriter: odf.opendocument.OpenDocumentSpreadsheet(**engine_kwargs)

    New in version 1.3.0.

Return type:

ExcelWriter

Notes

For compatibility with CSV writers, ExcelWriter serializes lists and dicts to strings before writing.

Examples

Default usage:

>>> df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])  
>>> with pd.ExcelWriter("path_to_file.xlsx") as writer:
...     df.to_excel(writer)  

To write to separate sheets in a single file:

>>> df1 = pd.DataFrame([["AAA", "BBB"]], columns=["Spam", "Egg"])  
>>> df2 = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])  
>>> with pd.ExcelWriter("path_to_file.xlsx") as writer:
...     df1.to_excel(writer, sheet_name="Sheet1")  
...     df2.to_excel(writer, sheet_name="Sheet2")  

You can set the date format or datetime format:

>>> from datetime import date, datetime  
>>> df = pd.DataFrame(
...     [
...         [date(2014, 1, 31), date(1999, 9, 24)],
...         [datetime(1998, 5, 26, 23, 33, 4), datetime(2014, 2, 28, 13, 5, 13)],
...     ],
...     index=["Date", "Datetime"],
...     columns=["X", "Y"],
... )  
>>> with pd.ExcelWriter(
...     "path_to_file.xlsx",
...     date_format="YYYY-MM-DD",
...     datetime_format="YYYY-MM-DD HH:MM:SS"
... ) as writer:
...     df.to_excel(writer)  

You can also append to an existing Excel file:

>>> with pd.ExcelWriter("path_to_file.xlsx", mode="a", engine="openpyxl") as writer:
...     df.to_excel(writer, sheet_name="Sheet3")  

Here, the if_sheet_exists parameter can be set to replace a sheet if it already exists:

>>> with ExcelWriter(
...     "path_to_file.xlsx",
...     mode="a",
...     engine="openpyxl",
...     if_sheet_exists="replace",
... ) as writer:
...     df.to_excel(writer, sheet_name="Sheet1")  

You can also write multiple DataFrames to a single sheet. Note that the if_sheet_exists parameter needs to be set to overlay:

>>> with ExcelWriter("path_to_file.xlsx",
...     mode="a",
...     engine="openpyxl",
...     if_sheet_exists="overlay",
... ) as writer:
...     df1.to_excel(writer, sheet_name="Sheet1")
...     df2.to_excel(writer, sheet_name="Sheet1", startcol=3)  

You can store Excel file in RAM:

>>> import io
>>> df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])
>>> buffer = io.BytesIO()
>>> with pd.ExcelWriter(buffer) as writer:
...     df.to_excel(writer)

You can pack Excel file into zip archive:

>>> import zipfile  
>>> df = pd.DataFrame([["ABC", "XYZ"]], columns=["Foo", "Bar"])  
>>> with zipfile.ZipFile("path_to_file.zip", "w") as zf:
...     with zf.open("filename.xlsx", "w") as buffer:
...         with pd.ExcelWriter(buffer) as writer:
...             df.to_excel(writer)  

You can specify additional arguments to the underlying engine:

>>> with pd.ExcelWriter(
...     "path_to_file.xlsx",
...     engine="xlsxwriter",
...     engine_kwargs={"options": {"nan_inf_to_errors": True}}
... ) as writer:
...     df.to_excel(writer)  

In append mode, engine_kwargs are passed through to openpyxl’s load_workbook:

>>> with pd.ExcelWriter(
...     "path_to_file.xlsx",
...     engine="openpyxl",
...     mode="a",
...     engine_kwargs={"keep_vba": True}
... ) as writer:
...     df.to_excel(writer, sheet_name="Sheet2")  
property supported_extensions: tuple[str, ...]

Extensions that writer engine supports.

property engine: str

Name of engine.

abstract property sheets: dict[str, Any]

Mapping of sheet names to sheet objects.

abstract property book

Book instance. Class type will depend on the engine used.

This attribute can be used to access engine-specific features.

property date_format: str

Format string for dates written into Excel files (e.g. ‘YYYY-MM-DD’).

property datetime_format: str

Format string for dates written into Excel files (e.g. ‘YYYY-MM-DD’).

property if_sheet_exists: str

How to behave when writing to a sheet that already exists in append mode.

classmethod check_extension(ext)[source]

checks that path’s extension against the Writer’s supported extensions. If it isn’t supported, raises UnsupportedFiletypeError.

Parameters:

ext (str) –

Return type:

Literal[True]

close()[source]

synonym for save, to make it more file-like

Return type:

None

class pandas.Flags[source]
property allows_duplicate_labels: bool

Whether this object allows duplicate labels.

Setting allows_duplicate_labels=False ensures that the index (and columns of a DataFrame) are unique. Most methods that accept and return a Series or DataFrame will propagate the value of allows_duplicate_labels.

See duplicates for more.

See also

DataFrame.attrs

Set global metadata on this object.

DataFrame.set_flags

Set global flags on this object.

Examples

>>> df = pd.DataFrame({"A": [1, 2]}, index=['a', 'a'])
>>> df.flags.allows_duplicate_labels
True
>>> df.flags.allows_duplicate_labels = False
Traceback (most recent call last):
    ...
pandas.errors.DuplicateLabelError: Index has duplicates.
      positions
label
a        [0, 1]
class pandas.Float32Dtype[source]

An ExtensionDtype for float32 data.

This dtype uses pd.NA as missing value indicator.

None
None()
type

alias of float32

name: str = 'Float32'
class pandas.Float64Dtype[source]

An ExtensionDtype for float64 data.

This dtype uses pd.NA as missing value indicator.

None
None()
type

alias of float64

name: str = 'Float64'
class pandas.Grouper[source]

A Grouper allows the user to specify a groupby instruction for an object.

This specification will select a column via the key parameter, or if the level and/or axis parameters are given, a level of the index of the target object.

If axis and/or level are passed as keywords to both Grouper and groupby, the values passed to Grouper take precedence.

Parameters:
  • key (str, defaults to None) – Groupby key, which selects the grouping column of the target.

  • level (name/number, defaults to None) – The level for the target index.

  • freq (str / frequency object, defaults to None) –

    This will groupby the specified frequency if the target selection (via key or level) is a datetime-like object. For full specification of available frequencies, please see here.

  • axis (str, int, defaults to 0) – Number/name of the axis.

  • sort (bool, default to False) – Whether to sort the resulting labels.

  • closed ({'left' or 'right'}) – Closed end of interval. Only when freq parameter is passed.

  • label ({'left' or 'right'}) – Interval boundary to use for labeling. Only when freq parameter is passed.

  • convention ({'start', 'end', 'e', 's'}) – If grouper is PeriodIndex and freq parameter is passed.

  • origin (Timestamp or str, default 'start_day') –

    The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:

    • ’epoch’: origin is 1970-01-01

    • ’start’: origin is the first value of the timeseries

    • ’start_day’: origin is the first day at midnight of the timeseries

    New in version 1.1.0.

    • ’end’: origin is the last value of the timeseries

    • ’end_day’: origin is the ceiling midnight of the last day

    New in version 1.3.0.

  • offset (Timedelta or str, default is None) –

    An offset timedelta added to the origin.

    New in version 1.1.0.

  • dropna (bool, default True) –

    If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

    New in version 1.2.0.

Return type:

A specification for a groupby instruction

Examples

Syntactic sugar for df.groupby('A')

>>> df = pd.DataFrame(
...     {
...         "Animal": ["Falcon", "Parrot", "Falcon", "Falcon", "Parrot"],
...         "Speed": [100, 5, 200, 300, 15],
...     }
... )
>>> df
   Animal  Speed
0  Falcon    100
1  Parrot      5
2  Falcon    200
3  Falcon    300
4  Parrot     15
>>> df.groupby(pd.Grouper(key="Animal")).mean()
        Speed
Animal
Falcon  200.0
Parrot   10.0

Specify a resample operation on the column ‘Publish date’

>>> df = pd.DataFrame(
...    {
...        "Publish date": [
...             pd.Timestamp("2000-01-02"),
...             pd.Timestamp("2000-01-02"),
...             pd.Timestamp("2000-01-09"),
...             pd.Timestamp("2000-01-16")
...         ],
...         "ID": [0, 1, 2, 3],
...         "Price": [10, 20, 30, 40]
...     }
... )
>>> df
  Publish date  ID  Price
0   2000-01-02   0     10
1   2000-01-02   1     20
2   2000-01-09   2     30
3   2000-01-16   3     40
>>> df.groupby(pd.Grouper(key="Publish date", freq="1W")).mean()
               ID  Price
Publish date
2000-01-02    0.5   15.0
2000-01-09    2.0   30.0
2000-01-16    3.0   40.0

If you want to adjust the start of the bins based on a fixed timestamp:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min')).sum()
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min', origin='epoch')).sum()
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min', origin='2000-01-01')).sum()
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17T, dtype: int64

If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:

>>> ts.groupby(pd.Grouper(freq='17min', origin='start')).sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64
>>> ts.groupby(pd.Grouper(freq='17min', offset='23h30min')).sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64

To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to have base=2:

>>> ts.groupby(pd.Grouper(freq='17min', offset='2min')).sum()
2000-10-01 23:16:00     0
2000-10-01 23:33:00     9
2000-10-01 23:50:00    36
2000-10-02 00:07:00    39
2000-10-02 00:24:00    24
Freq: 17T, dtype: int64
sort: bool
dropna: bool
property ax: Index
property indexer
property obj
property grouper
property groups
class pandas.HDFStore[source]

Dict-like IO interface for storing pandas objects in PyTables.

Either Fixed or Table format.

Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

Parameters:
  • path (str) – File path to HDF5 file.

  • mode ({'a', 'w', 'r', 'r+'}, default 'a') –

    'r'

    Read-only; no data can be modified.

    'w'

    Write; a new file is created (an existing file with the same name would be deleted).

    'a'

    Append; an existing file is opened for reading and writing, and if the file does not exist it is created.

    'r+'

    It is similar to 'a', but the file must already exist.

  • complevel (int, 0-9, default None) – Specifies a compression level for data. A value of 0 or None disables compression.

  • complib ({'zlib', 'lzo', 'bzip2', 'blosc'}, default 'zlib') –

    Specifies the compression library to be used. As of v0.20.2 these additional compressors for Blosc are supported (default if no compressor specified: ‘blosc:blosclz’): {‘blosc:blosclz’, ‘blosc:lz4’, ‘blosc:lz4hc’, ‘blosc:snappy’,

    ’blosc:zlib’, ‘blosc:zstd’}.

    Specifying a compression library which is not available issues a ValueError.

  • fletcher32 (bool, default False) – If applying compression use the fletcher32 checksum.

  • **kwargs – These parameters will be passed to the PyTables open_file method.

Examples

>>> bar = pd.DataFrame(np.random.randn(10, 4))
>>> store = pd.HDFStore('test.h5')
>>> store['foo'] = bar   # write to HDF5
>>> bar = store['foo']   # retrieve
>>> store.close()

Create or load HDF5 file in-memory

When passing the driver option to the PyTables open_file method through **kwargs, the HDF5 file is loaded or created in-memory and will only be written when closed:

>>> bar = pd.DataFrame(np.random.randn(10, 4))
>>> store = pd.HDFStore('test.h5', driver='H5FD_CORE')
>>> store['foo'] = bar
>>> store.close()   # only now, data is written to disk
property root

return the root node

property filename: str
keys(include='pandas')[source]

Return a list of keys corresponding to objects stored in HDFStore.

Parameters:

include (str, default 'pandas') –

When kind equals ‘pandas’ return pandas objects. When kind equals ‘native’ return native HDF5 Table objects.

New in version 1.1.0.

Returns:

List of ABSOLUTE path-names (e.g. have the leading ‘/’).

Return type:

list

Raises:

raises ValueError if kind has an illegal value

items()[source]

iterate on key->group

Return type:

Iterator[tuple[str, list]]

open(mode='a', **kwargs)[source]

Open the file in the specified mode

Parameters:
  • mode ({'a', 'w', 'r', 'r+'}, default 'a') – See HDFStore docstring or tables.open_file for info about modes

  • **kwargs – These parameters will be passed to the PyTables open_file method.

Return type:

None

close()[source]

Close the PyTables file handle

Return type:

None

property is_open: bool

return a boolean indicating whether the file is open

flush(fsync=False)[source]

Force all buffered modifications to be written to disk.

Parameters:

fsync (bool (default False)) – call os.fsync() on the file handle to force writing to disk.

Return type:

None

Notes

Without fsync=True, flushing may not guarantee that the OS writes to disk. With fsync, the operation will block until the OS claims the file has been written; however, other caching layers may still interfere.

get(key)[source]

Retrieve pandas object stored in file.

Parameters:

key (str) –

Returns:

Same type as object stored in file.

Return type:

object

select(key, where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, auto_close=False)[source]

Retrieve pandas object stored in file, optionally based on where criteria.

Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

Parameters:
  • key (str) – Object being retrieved from file.

  • where (list or None) – List of Term (or convertible) objects, optional.

  • start (int or None) – Row number to start selection.

  • stop (int, default None) – Row number to stop selection.

  • columns (list or None) – A list of columns that if not None, will limit the return columns.

  • iterator (bool or False) – Returns an iterator.

  • chunksize (int or None) – Number or rows to include in iteration, return an iterator.

  • auto_close (bool or False) – Should automatically close the store when finished.

Returns:

Retrieved object from file.

Return type:

object

select_as_coordinates(key, where=None, start=None, stop=None)[source]

return the selection as an Index

Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

Parameters:
  • key (str) –

  • where (list of Term (or convertible) objects, optional) –

  • start (integer (defaults to None), row number to start selection) –

  • stop (integer (defaults to None), row number to stop selection) –

select_column(key, column, start=None, stop=None)[source]

return a single column from the table. This is generally only useful to select an indexable

Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

Parameters:
  • key (str) –

  • column (str) – The column of interest.

  • start (int or None, default None) –

  • stop (int or None, default None) –

Raises:
  • raises KeyError if the column is not found (or key is not a valid – store)

  • raises ValueError if the column can not be extracted individually (it – is part of a data block)

select_as_multiple(keys, where=None, selector=None, columns=None, start=None, stop=None, iterator=False, chunksize=None, auto_close=False)[source]

Retrieve pandas objects from multiple tables.

Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

Parameters:
  • keys (a list of the tables) –

  • selector (the table to apply the where criteria (defaults to keys[0]) – if not supplied)

  • columns (the columns I want back) –

  • start (integer (defaults to None), row number to start selection) –

  • stop (integer (defaults to None), row number to stop selection) –

  • iterator (bool, return an iterator, default False) –

  • chunksize (nrows to include in iteration, return an iterator) –

  • auto_close (bool, default False) – Should automatically close the store when finished.

Raises:
  • raises KeyError if keys or selector is not found or keys is empty

  • raises TypeError if keys is not a list or tuple

  • raises ValueError if the tables are not ALL THE SAME DIMENSIONS

put(key, value, format=None, index=True, append=False, complib=None, complevel=None, min_itemsize=None, nan_rep=None, data_columns=None, encoding=None, errors='strict', track_times=True, dropna=False)[source]

Store object in HDFStore.

Parameters:
  • key (str) –

  • value ({Series, DataFrame}) –

  • format ('fixed(f)|table(t)', default is 'fixed') –

    Format to use when storing object in HDFStore. Value can be one of:

    'fixed'

    Fixed format. Fast writing/reading. Not-appendable, nor searchable.

    'table'

    Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

  • index (bool, default True) – Write DataFrame index as a column.

  • append (bool, default False) – This will force Table format, append the input data to the existing.

  • data_columns (list of columns or True, default None) – List of columns to create as data columns, or True to use all columns. See here.

  • encoding (str, default None) – Provide an encoding for strings.

  • track_times (bool, default True) – Parameter is propagated to ‘create_table’ method of ‘PyTables’. If set to False it enables to have the same h5 files (same hashes) independent on creation time.

  • dropna (bool, default False, optional) –

    Remove missing values.

    New in version 1.1.0.

  • complevel (int | None) –

  • min_itemsize (int | dict[str, int] | None) –

  • errors (str) –

Return type:

None

remove(key, where=None, start=None, stop=None)[source]

Remove pandas object partially by specifying the where condition

Parameters:
  • key (str) – Node to remove or delete rows from

  • where (list of Term (or convertible) objects, optional) –

  • start (integer (defaults to None), row number to start selection) –

  • stop (integer (defaults to None), row number to stop selection) –

Return type:

number of rows removed (or None if not a Table)

Raises:

raises KeyError if key is not a valid store

append(key, value, format=None, axes=None, index=True, append=True, complib=None, complevel=None, columns=None, min_itemsize=None, nan_rep=None, chunksize=None, expectedrows=None, dropna=None, data_columns=None, encoding=None, errors='strict')[source]

Append to Table in file.

Node must already exist and be Table format.

Parameters:
  • key (str) –

  • value ({Series, DataFrame}) –

  • format ('table' is the default) –

    Format to use when storing object in HDFStore. Value can be one of:

    'table'

    Table format. Write as a PyTables Table structure which may perform worse but allow more flexible operations like searching / selecting subsets of the data.

  • index (bool, default True) – Write DataFrame index as a column.

  • append (bool, default True) – Append the input data to the existing.

  • data_columns (list of columns, or True, default None) – List of columns to create as indexed data columns for on-disk queries, or True to use all columns. By default only the axes of the object are indexed. See here.

  • min_itemsize (dict of columns that specify minimum str sizes) –

  • nan_rep (str to use as str nan representation) –

  • chunksize (size to chunk the writing) –

  • expectedrows (expected TOTAL row size of this table) –

  • encoding (default None, provide an encoding for str) –

  • dropna (bool, default False, optional) – Do not write an ALL nan row to the store settable by the option ‘io.hdf.dropna_table’.

  • complevel (int | None) –

  • errors (str) –

Return type:

None

Notes

Does not check if data being appended overlaps with existing data in the table, so be careful

append_to_multiple(d, value, selector, data_columns=None, axes=None, dropna=False, **kwargs)[source]

Append to multiple tables

Parameters:
  • d (a dict of table_name to table_columns, None is acceptable as the) – values of one node (this will get all the remaining columns)

  • value (a pandas object) –

  • selector (a string that designates the indexable table; all of its) – columns will be designed as data_columns, unless data_columns is passed, in which case these are used

  • data_columns (list of columns to create as data columns, or True to) – use all columns

  • dropna (if evaluates to True, drop rows from all tables if any single) – row in each table has all NaN. Default False.

Return type:

None

Notes

axes parameter is currently not accepted

create_table_index(key, columns=None, optlevel=None, kind=None)[source]

Create a pytables index on the table.

Parameters:
  • key (str) –

  • columns (None, bool, or listlike[str]) –

    Indicate which columns to create an index on.

    • False : Do not create any indexes.

    • True : Create indexes on all columns.

    • None : Create indexes on all columns.

    • listlike : Create indexes on the given columns.

  • optlevel (int or None, default None) – Optimization level, if None, pytables defaults to 6.

  • kind (str or None, default None) – Kind of index, if None, pytables defaults to “medium”.

Raises:

TypeError – raises if the node is not a table:

Return type:

None

groups()[source]

Return a list of all the top-level nodes.

Each node returned is not a pandas storage object.

Returns:

List of objects.

Return type:

list

walk(where='/')[source]

Walk the pytables group hierarchy for pandas objects.

This generator will yield the group path, subgroups and pandas object names for each group.

Any non-pandas PyTables objects that are not a group will be ignored.

The where group itself is listed first (preorder), then each of its child groups (following an alphanumerical order) is also traversed, following the same procedure.

Parameters:

where (str, default "/") – Group where to start walking.

Yields:
  • path (str) – Full path to a group (without trailing ‘/’).

  • groups (list) – Names (strings) of the groups contained in path.

  • leaves (list) – Names (strings) of the pandas objects contained in path.

Return type:

Iterator[tuple[str, list[str], list[str]]]

get_node(key)[source]

return the node with the key or None if it does not exist

Parameters:

key (str) –

Return type:

Node | None

get_storer(key)[source]

return the storer object for a key, raise if not in the file

Parameters:

key (str) –

Return type:

GenericFixed | Table

copy(file, mode='w', propindexes=True, keys=None, complib=None, complevel=None, fletcher32=False, overwrite=True)[source]

Copy the existing store to a new file, updating in place.

Parameters:
  • propindexes (bool, default True) – Restore indexes in copied file.

  • keys (list, optional) – List of keys to include in the copy (defaults to all).

  • overwrite (bool, default True) – Whether to overwrite (remove and replace) existing nodes in the new store.

  • mode (str) –

  • complib

  • complevel (int | None) –

  • HDFStore.__init__ (fletcher32 same as in) –

  • fletcher32 (bool) –

Return type:

open file handle of the new store

info()[source]

Print detailed information on the store.

Return type:

str

class pandas.Index[source]

Immutable sequence used for indexing and alignment.

The basic object storing axis labels for all pandas objects.

Changed in version 2.0.0: Index can hold all numpy numeric dtypes (except float16). Previously only int64/uint64/float64 dtypes were accepted.

Parameters:
  • data (array-like (1-dimensional)) –

  • dtype (NumPy dtype (default: object)) – If dtype is None, we find the dtype that best fits the data. If an actual dtype is provided, we coerce to that dtype if it’s safe. Otherwise, an error will be raised.

  • copy (bool) – Make a copy of input ndarray.

  • name (object) – Name to be stored in the index.

  • tupleize_cols (bool (default: True)) – When True, attempt to create a MultiIndex if possible.

Return type:

Index

See also

RangeIndex

Index implementing a monotonic integer range.

CategoricalIndex

Index of Categorical s.

MultiIndex

A multi-level, or hierarchical Index.

IntervalIndex

An Index of Interval s.

DatetimeIndex

Index of datetime64 data.

TimedeltaIndex

Index of timedelta64 data.

PeriodIndex

Index of Period data.

Notes

An Index instance can only contain hashable objects. An Index instance can not hold numpy float16 dtype.

Examples

>>> pd.Index([1, 2, 3])
Index([1, 2, 3], dtype='int64')
>>> pd.Index(list('abc'))
Index(['a', 'b', 'c'], dtype='object')
>>> pd.Index([1, 2, 3], dtype="uint8")
Index([1, 2, 3], dtype='uint8')
str

alias of StringMethods

final is_(other)[source]

More flexible, faster check like is but that works through views.

Note: this is not the same as Index.identical(), which checks that metadata is also the same.

Parameters:

other (object) – Other object to compare against.

Returns:

True if both have same underlying data, False otherwise.

Return type:

bool

See also

Index.identical

Works like Index.is_ but also checks metadata.

dtype

Return the dtype object of the underlying data.

final ravel(order='C')[source]

Return a view on self.

Return type:

Index

Parameters:

order (str) –

See also

numpy.ndarray.ravel

Return a flattened array.

view(cls=None)[source]
astype(dtype, copy=True)[source]

Create an Index with values cast to dtypes.

The class of a new Index is determined by dtype. When conversion is impossible, a TypeError exception is raised.

Parameters:
  • dtype (numpy dtype or pandas type) – Note that any signed integer dtype is treated as 'int64', and any unsigned integer dtype is treated as 'uint64', regardless of the size.

  • copy (bool, default True) – By default, astype always returns a newly allocated object. If copy is set to False and internal requirements on dtype are satisfied, the original data is used to create a new Index or the original Index is returned.

Returns:

Index with values cast to specified dtype.

Return type:

Index

take(indices, axis=0, allow_fill=True, fill_value=None, **kwargs)[source]

Return a new Index of the values selected by the indices.

For internal compatibility with numpy arrays.

Parameters:
  • indices (array-like) – Indices to be taken.

  • axis (int, optional) – The axis over which to select values, always 0.

  • allow_fill (bool, default True) –

  • fill_value (scalar, default None) – If allow_fill=True and fill_value is not None, indices specified by -1 are regarded as NA. If Index doesn’t hold NA, raise ValueError.

Returns:

An index formed of elements at the given indices. Will be the same type as self, except for RangeIndex.

Return type:

Index

See also

numpy.ndarray.take

Return an array formed from the elements of a at the given indices.

repeat(repeats, axis=None)[source]

Repeat elements of a Index.

Returns a new Index where each element of the current Index is repeated consecutively a given number of times.

Parameters:
  • repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty Index.

  • axis (None) – Must be None. Has no effect but is accepted for compatibility with numpy.

Returns:

Newly created Index with repeated elements.

Return type:

Index

See also

Series.repeat

Equivalent function for Series.

numpy.repeat

Similar method for numpy.ndarray.

Examples

>>> idx = pd.Index(['a', 'b', 'c'])
>>> idx
Index(['a', 'b', 'c'], dtype='object')
>>> idx.repeat(2)
Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object')
>>> idx.repeat([1, 2, 3])
Index(['a', 'b', 'b', 'c', 'c', 'c'], dtype='object')
copy(name=None, deep=False)[source]

Make a copy of this object.

Name is set on the new object.

Parameters:
  • name (Label, optional) – Set name for new object.

  • deep (bool, default False) –

  • self (_IndexT) –

Returns:

Index refer to new object which is a copy of this object.

Return type:

Index

Notes

In most cases, there should be no functional difference from using deep, but if deep is passed it will attempt to deepcopy.

format(name=False, formatter=None, na_rep='NaN')[source]

Render a string representation of the Index.

Parameters:
Return type:

list[str]

to_flat_index()[source]

Identity method.

This is implemented for compatibility with subclass implementations when chaining.

Returns:

Caller.

Return type:

pd.Index

Parameters:

self (_IndexT) –

See also

MultiIndex.to_flat_index

Subclass implementation.

final to_series(index=None, name=None)[source]

Create a Series with both index and values equal to the index keys.

Useful with map for returning an indexer based on an index.

Parameters:
  • index (Index, optional) – Index of resulting Series. If None, defaults to original index.

  • name (str, optional) – Name of resulting Series. If None, defaults to name of original index.

Returns:

The dtype will be based on the type of the Index values.

Return type:

Series

See also

Index.to_frame

Convert an Index to a DataFrame.

Series.to_frame

Convert Series to DataFrame.

Examples

>>> idx = pd.Index(['Ant', 'Bear', 'Cow'], name='animal')

By default, the original Index and original name is reused.

>>> idx.to_series()
animal
Ant      Ant
Bear    Bear
Cow      Cow
Name: animal, dtype: object

To enforce a new Index, specify new labels to index:

>>> idx.to_series(index=[0, 1, 2])
0     Ant
1    Bear
2     Cow
Name: animal, dtype: object

To override the name of the resulting column, specify name:

>>> idx.to_series(name='zoo')
animal
Ant      Ant
Bear    Bear
Cow      Cow
Name: zoo, dtype: object
to_frame(index=True, name=_NoDefault.no_default)[source]

Create a DataFrame with a column containing the Index.

Parameters:
  • index (bool, default True) – Set the index of the returned DataFrame as the original Index.

  • name (object, defaults to index.name) – The passed name should substitute for the index name (if it has one).

Returns:

DataFrame containing the original Index data.

Return type:

DataFrame

See also

Index.to_series

Convert an Index to a Series.

Series.to_frame

Convert Series to DataFrame.

Examples

>>> idx = pd.Index(['Ant', 'Bear', 'Cow'], name='animal')
>>> idx.to_frame()
       animal
animal
Ant       Ant
Bear     Bear
Cow       Cow

By default, the original Index is reused. To enforce a new Index:

>>> idx.to_frame(index=False)
    animal
0   Ant
1  Bear
2   Cow

To override the name of the resulting column, specify name:

>>> idx.to_frame(index=False, name='zoo')
    zoo
0   Ant
1  Bear
2   Cow
property name: Hashable

Return Index or MultiIndex name.

property names: FrozenList
set_names(names, *, level=None, inplace: Literal[False] = False) _IndexT[source]
set_names(names, *, level=None, inplace: Literal[True]) None
set_names(names, *, level=None, inplace: bool = False) _IndexT | None

Set Index or MultiIndex name.

Able to set new names partially and by level.

Parameters:
  • names (label or list of label or dict-like for MultiIndex) –

    Name(s) to set.

    Changed in version 1.3.0.

  • level (int, label or list of int or label, optional) –

    If the index is a MultiIndex and names is not dict-like, level(s) to set (None for all levels). Otherwise level must be None.

    Changed in version 1.3.0.

  • inplace (bool, default False) – Modifies the object directly, instead of creating a new Index or MultiIndex.

Returns:

The same type as the caller or None if inplace=True.

Return type:

Index or None

See also

Index.rename

Able to set new names without level.

Examples

>>> idx = pd.Index([1, 2, 3, 4])
>>> idx
Index([1, 2, 3, 4], dtype='int64')
>>> idx.set_names('quarter')
Index([1, 2, 3, 4], dtype='int64', name='quarter')
>>> idx = pd.MultiIndex.from_product([['python', 'cobra'],
...                                   [2018, 2019]])
>>> idx
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           )
>>> idx = idx.set_names(['kind', 'year'])
>>> idx.set_names('species', level=0)
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           names=['species', 'year'])

When renaming levels with a dict, levels can not be passed.

>>> idx.set_names({'kind': 'snake'})
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           names=['snake', 'year'])
rename(name, inplace=False)[source]

Alter Index or MultiIndex name.

Able to set new names without level. Defaults to returning new index. Length of names must match number of levels in MultiIndex.

Parameters:
  • name (label or list of labels) – Name(s) to set.

  • inplace (bool, default False) – Modifies the object directly, instead of creating a new Index or MultiIndex.

Returns:

The same type as the caller or None if inplace=True.

Return type:

Index or None

See also

Index.set_names

Able to set new names partially and by level.

Examples

>>> idx = pd.Index(['A', 'C', 'A', 'B'], name='score')
>>> idx.rename('grade')
Index(['A', 'C', 'A', 'B'], dtype='object', name='grade')
>>> idx = pd.MultiIndex.from_product([['python', 'cobra'],
...                                   [2018, 2019]],
...                                   names=['kind', 'year'])
>>> idx
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           names=['kind', 'year'])
>>> idx.rename(['species', 'year'])
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           names=['species', 'year'])
>>> idx.rename('species')
Traceback (most recent call last):
TypeError: Must pass list-like as `names`.
property nlevels: int

Number of levels.

sortlevel(level=None, ascending=True, sort_remaining=None)[source]

For internal compatibility with the Index API.

Sort the Index. This is for compat with MultiIndex

Parameters:
  • ascending (bool, default True) – False to sort in descending order

  • level

  • parameters (sort_remaining are compat) –

Return type:

Index

get_level_values(level)

Return an Index of values for requested level.

This is primarily useful to get an individual level of values from a MultiIndex, but is provided on Index as well for compatibility.

Parameters:

level (int or str) – It is either the integer position or the name of the level.

Returns:

Calling object, as there is only one level in the Index.

Return type:

Index

See also

MultiIndex.get_level_values

Get values for a level of a MultiIndex.

Notes

For Index, level should be 0, since there are no multiple levels.

Examples

>>> idx = pd.Index(list('abc'))
>>> idx
Index(['a', 'b', 'c'], dtype='object')

Get level values by supplying level as integer:

>>> idx.get_level_values(0)
Index(['a', 'b', 'c'], dtype='object')
final droplevel(level=0)[source]

Return index with requested level(s) removed.

If resulting index has only 1 level left, the result will be of Index type, not MultiIndex. The original index is not modified inplace.

Parameters:

level (int, str, or list-like, default 0) – If a string is given, must be the name of a level If list-like, elements must be names or indexes of levels.

Return type:

Index or MultiIndex

Examples

>>> mi = pd.MultiIndex.from_arrays(
... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z'])
>>> mi
MultiIndex([(1, 3, 5),
            (2, 4, 6)],
           names=['x', 'y', 'z'])
>>> mi.droplevel()
MultiIndex([(3, 5),
            (4, 6)],
           names=['y', 'z'])
>>> mi.droplevel(2)
MultiIndex([(1, 3),
            (2, 4)],
           names=['x', 'y'])
>>> mi.droplevel('z')
MultiIndex([(1, 3),
            (2, 4)],
           names=['x', 'y'])
>>> mi.droplevel(['x', 'y'])
Index([5, 6], dtype='int64', name='z')
property is_monotonic_increasing: bool

Return a boolean if the values are equal or increasing.

Return type:

bool

See also

Index.is_monotonic_decreasing

Check if the values are equal or decreasing.

Examples

>>> pd.Index([1, 2, 3]).is_monotonic_increasing
True
>>> pd.Index([1, 2, 2]).is_monotonic_increasing
True
>>> pd.Index([1, 3, 2]).is_monotonic_increasing
False
property is_monotonic_decreasing: bool

Return a boolean if the values are equal or decreasing.

Return type:

bool

See also

Index.is_monotonic_increasing

Check if the values are equal or increasing.

Examples

>>> pd.Index([3, 2, 1]).is_monotonic_decreasing
True
>>> pd.Index([3, 2, 2]).is_monotonic_decreasing
True
>>> pd.Index([3, 1, 2]).is_monotonic_decreasing
False
is_unique

Return if the index has unique values.

Return type:

bool

See also

Index.has_duplicates

Inverse method that checks if it has duplicate values.

Examples

>>> idx = pd.Index([1, 5, 7, 7])
>>> idx.is_unique
False
>>> idx = pd.Index([1, 5, 7])
>>> idx.is_unique
True
>>> idx = pd.Index(["Watermelon", "Orange", "Apple",
...                 "Watermelon"]).astype("category")
>>> idx.is_unique
False
>>> idx = pd.Index(["Orange", "Apple",
...                 "Watermelon"]).astype("category")
>>> idx.is_unique
True
property has_duplicates: bool

Check if the Index has duplicate values.

Returns:

Whether or not the Index has duplicate values.

Return type:

bool

See also

Index.is_unique

Inverse method that checks if it has unique values.

Examples

>>> idx = pd.Index([1, 5, 7, 7])
>>> idx.has_duplicates
True
>>> idx = pd.Index([1, 5, 7])
>>> idx.has_duplicates
False
>>> idx = pd.Index(["Watermelon", "Orange", "Apple",
...                 "Watermelon"]).astype("category")
>>> idx.has_duplicates
True
>>> idx = pd.Index(["Orange", "Apple",
...                 "Watermelon"]).astype("category")
>>> idx.has_duplicates
False
final is_boolean()[source]

Check if the Index only consists of booleans.

Deprecated since version 2.0.0: Use pandas.api.types.is_bool_dtype instead.

Returns:

Whether or not the Index only consists of booleans.

Return type:

bool

See also

is_integer

Check if the Index only consists of integers (deprecated).

is_floating

Check if the Index is a floating type (deprecated).

is_numeric

Check if the Index only consists of numeric data (deprecated).

is_object

Check if the Index is of the object dtype (deprecated).

is_categorical

Check if the Index holds categorical data.

is_interval

Check if the Index holds Interval objects (deprecated).

Examples

>>> idx = pd.Index([True, False, True])
>>> idx.is_boolean()  
True
>>> idx = pd.Index(["True", "False", "True"])
>>> idx.is_boolean()  
False
>>> idx = pd.Index([True, False, "True"])
>>> idx.is_boolean()  
False
final is_integer()[source]

Check if the Index only consists of integers.

Deprecated since version 2.0.0: Use pandas.api.types.is_integer_dtype instead.

Returns:

Whether or not the Index only consists of integers.

Return type:

bool

See also

is_boolean

Check if the Index only consists of booleans (deprecated).

is_floating

Check if the Index is a floating type (deprecated).

is_numeric

Check if the Index only consists of numeric data (deprecated).

is_object

Check if the Index is of the object dtype. (deprecated).

is_categorical

Check if the Index holds categorical data (deprecated).

is_interval

Check if the Index holds Interval objects (deprecated).

Examples

>>> idx = pd.Index([1, 2, 3, 4])
>>> idx.is_integer()  
True
>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0])
>>> idx.is_integer()  
False
>>> idx = pd.Index(["Apple", "Mango", "Watermelon"])
>>> idx.is_integer()  
False
final is_floating()[source]

Check if the Index is a floating type.

Deprecated since version 2.0.0: Use pandas.api.types.is_float_dtype instead

The Index may consist of only floats, NaNs, or a mix of floats, integers, or NaNs.

Returns:

Whether or not the Index only consists of only consists of floats, NaNs, or a mix of floats, integers, or NaNs.

Return type:

bool

See also

is_boolean

Check if the Index only consists of booleans (deprecated).

is_integer

Check if the Index only consists of integers (deprecated).

is_numeric

Check if the Index only consists of numeric data (deprecated).

is_object

Check if the Index is of the object dtype. (deprecated).

is_categorical

Check if the Index holds categorical data (deprecated).

is_interval

Check if the Index holds Interval objects (deprecated).

Examples

>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0])
>>> idx.is_floating()  
True
>>> idx = pd.Index([1.0, 2.0, np.nan, 4.0])
>>> idx.is_floating()  
True
>>> idx = pd.Index([1, 2, 3, 4, np.nan])
>>> idx.is_floating()  
True
>>> idx = pd.Index([1, 2, 3, 4])
>>> idx.is_floating()  
False
final is_numeric()[source]

Check if the Index only consists of numeric data.

Deprecated since version 2.0.0: Use pandas.api.types.is_numeric_dtype instead.

Returns:

Whether or not the Index only consists of numeric data.

Return type:

bool

See also

is_boolean

Check if the Index only consists of booleans (deprecated).

is_integer

Check if the Index only consists of integers (deprecated).

is_floating

Check if the Index is a floating type (deprecated).

is_object

Check if the Index is of the object dtype. (deprecated).

is_categorical

Check if the Index holds categorical data (deprecated).

is_interval

Check if the Index holds Interval objects (deprecated).

Examples

>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0])
>>> idx.is_numeric()  
True
>>> idx = pd.Index([1, 2, 3, 4.0])
>>> idx.is_numeric()  
True
>>> idx = pd.Index([1, 2, 3, 4])
>>> idx.is_numeric()  
True
>>> idx = pd.Index([1, 2, 3, 4.0, np.nan])
>>> idx.is_numeric()  
True
>>> idx = pd.Index([1, 2, 3, 4.0, np.nan, "Apple"])
>>> idx.is_numeric()  
False
final is_object()[source]

Check if the Index is of the object dtype.

Deprecated since version 2.0.0: Use pandas.api.types.is_object_dtype instead.

Returns:

Whether or not the Index is of the object dtype.

Return type:

bool

See also

is_boolean

Check if the Index only consists of booleans (deprecated).

is_integer

Check if the Index only consists of integers (deprecated).

is_floating

Check if the Index is a floating type (deprecated).

is_numeric

Check if the Index only consists of numeric data (deprecated).

is_categorical

Check if the Index holds categorical data (deprecated).

is_interval

Check if the Index holds Interval objects (deprecated).

Examples

>>> idx = pd.Index(["Apple", "Mango", "Watermelon"])
>>> idx.is_object()  
True
>>> idx = pd.Index(["Apple", "Mango", 2.0])
>>> idx.is_object()  
True
>>> idx = pd.Index(["Watermelon", "Orange", "Apple",
...                 "Watermelon"]).astype("category")
>>> idx.is_object()  
False
>>> idx = pd.Index([1.0, 2.0, 3.0, 4.0])
>>> idx.is_object()  
False
final is_categorical()[source]

Check if the Index holds categorical data.

Deprecated since version 2.0.0: Use pandas.api.types.is_categorical_dtype() instead.

Returns:

True if the Index is categorical.

Return type:

bool

See also

CategoricalIndex

Index for categorical data.

is_boolean

Check if the Index only consists of booleans (deprecated).

is_integer

Check if the Index only consists of integers (deprecated).

is_floating

Check if the Index is a floating type (deprecated).

is_numeric

Check if the Index only consists of numeric data (deprecated).

is_object

Check if the Index is of the object dtype. (deprecated).

is_interval

Check if the Index holds Interval objects (deprecated).

Examples

>>> idx = pd.Index(["Watermelon", "Orange", "Apple",
...                 "Watermelon"]).astype("category")
>>> idx.is_categorical()  
True
>>> idx = pd.Index([1, 3, 5, 7])
>>> idx.is_categorical()  
False
>>> s = pd.Series(["Peter", "Victor", "Elisabeth", "Mar"])
>>> s
0        Peter
1       Victor
2    Elisabeth
3          Mar
dtype: object
>>> s.index.is_categorical()  
False
final is_interval()[source]

Check if the Index holds Interval objects.

Deprecated since version 2.0.0: Use pandas.api.types.is_interval_dtype instead.

Returns:

Whether or not the Index holds Interval objects.

Return type:

bool

See also

IntervalIndex

Index for Interval objects.

is_boolean

Check if the Index only consists of booleans (deprecated).

is_integer

Check if the Index only consists of integers (deprecated).

is_floating

Check if the Index is a floating type (deprecated).

is_numeric

Check if the Index only consists of numeric data (deprecated).

is_object

Check if the Index is of the object dtype. (deprecated).

is_categorical

Check if the Index holds categorical data (deprecated).

Examples

>>> idx = pd.Index([pd.Interval(left=0, right=5),
...                 pd.Interval(left=5, right=10)])
>>> idx.is_interval()  
True
>>> idx = pd.Index([1, 3, 5, 7])
>>> idx.is_interval()  
False
final holds_integer()[source]

Whether the type is an integer type.

Deprecated since version 2.0.0: Use pandas.api.types.infer_dtype instead

Return type:

bool

inferred_type

Return a string of the type inferred from the values.

hasnans

Return True if there are any NaNs.

Enables various performance speedups.

Return type:

bool

final isna()[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None, numpy.NaN or pd.NaT, get mapped to True values. Everything else get mapped to False values. Characters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

A boolean array of whether my values are NA.

Return type:

numpy.ndarray[bool]

See also

Index.notna

Boolean inverse of isna.

Index.dropna

Omit entries with missing values.

isna

Top-level isna.

Series.isna

Detect missing values in Series object.

Examples

Show which entries in a pandas.Index are NA. The result is an array.

>>> idx = pd.Index([5.2, 6.0, np.NaN])
>>> idx
Index([5.2, 6.0, nan], dtype='float64')
>>> idx.isna()
array([False, False,  True])

Empty strings are not considered NA values. None is considered an NA value.

>>> idx = pd.Index(['black', '', 'red', None])
>>> idx
Index(['black', '', 'red', None], dtype='object')
>>> idx.isna()
array([False, False, False,  True])

For datetimes, NaT (Not a Time) is considered as an NA value.

>>> idx = pd.DatetimeIndex([pd.Timestamp('1940-04-25'),
...                         pd.Timestamp(''), None, pd.NaT])
>>> idx
DatetimeIndex(['1940-04-25', 'NaT', 'NaT', 'NaT'],
              dtype='datetime64[ns]', freq=None)
>>> idx.isna()
array([False,  True,  True,  True])
isnull()

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None, numpy.NaN or pd.NaT, get mapped to True values. Everything else get mapped to False values. Characters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

A boolean array of whether my values are NA.

Return type:

numpy.ndarray[bool]

See also

Index.notna

Boolean inverse of isna.

Index.dropna

Omit entries with missing values.

isna

Top-level isna.

Series.isna

Detect missing values in Series object.

Examples

Show which entries in a pandas.Index are NA. The result is an array.

>>> idx = pd.Index([5.2, 6.0, np.NaN])
>>> idx
Index([5.2, 6.0, nan], dtype='float64')
>>> idx.isna()
array([False, False,  True])

Empty strings are not considered NA values. None is considered an NA value.

>>> idx = pd.Index(['black', '', 'red', None])
>>> idx
Index(['black', '', 'red', None], dtype='object')
>>> idx.isna()
array([False, False, False,  True])

For datetimes, NaT (Not a Time) is considered as an NA value.

>>> idx = pd.DatetimeIndex([pd.Timestamp('1940-04-25'),
...                         pd.Timestamp(''), None, pd.NaT])
>>> idx
DatetimeIndex(['1940-04-25', 'NaT', 'NaT', 'NaT'],
              dtype='datetime64[ns]', freq=None)
>>> idx.isna()
array([False,  True,  True,  True])
final notna()[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Boolean array to indicate which entries are not NA.

Return type:

numpy.ndarray[bool]

See also

Index.notnull

Alias of notna.

Index.isna

Inverse of notna.

notna

Top-level notna.

Examples

Show which entries in an Index are not NA. The result is an array.

>>> idx = pd.Index([5.2, 6.0, np.NaN])
>>> idx
Index([5.2, 6.0, nan], dtype='float64')
>>> idx.notna()
array([ True,  True, False])

Empty strings are not considered NA values. None is considered a NA value.

>>> idx = pd.Index(['black', '', 'red', None])
>>> idx
Index(['black', '', 'red', None], dtype='object')
>>> idx.notna()
array([ True,  True,  True, False])
notnull()

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Boolean array to indicate which entries are not NA.

Return type:

numpy.ndarray[bool]

See also

Index.notnull

Alias of notna.

Index.isna

Inverse of notna.

notna

Top-level notna.

Examples

Show which entries in an Index are not NA. The result is an array.

>>> idx = pd.Index([5.2, 6.0, np.NaN])
>>> idx
Index([5.2, 6.0, nan], dtype='float64')
>>> idx.notna()
array([ True,  True, False])

Empty strings are not considered NA values. None is considered a NA value.

>>> idx = pd.Index(['black', '', 'red', None])
>>> idx
Index(['black', '', 'red', None], dtype='object')
>>> idx.notna()
array([ True,  True,  True, False])
fillna(value=None, downcast=None)[source]

Fill NA/NaN values with the specified value.

Parameters:
  • value (scalar) – Scalar value to use to fill holes (e.g. 0). This value cannot be a list-likes.

  • downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Return type:

Index

See also

DataFrame.fillna

Fill NaN values of a DataFrame.

Series.fillna

Fill NaN Values of a Series.

dropna(how='any')[source]

Return Index without NA/NaN values.

Parameters:
  • how ({'any', 'all'}, default 'any') – If the Index is a MultiIndex, drop the value when any or all levels are NaN.

  • self (_IndexT) –

Return type:

Index

unique(level=None)[source]

Return unique values in the index.

Unique values are returned in order of appearance, this does NOT sort.

Parameters:
  • level (int or hashable, optional) – Only return values from specified level (for MultiIndex). If int, gets the level by integer position, else by level name.

  • self (_IndexT) –

Return type:

Index

See also

unique

Numpy array of unique values in that column.

Series.unique

Return unique values of Series object.

drop_duplicates(*, keep='first')[source]

Return Index with duplicate values removed.

Parameters:
  • keep ({‘first’, ‘last’, False}, default ‘first’) –

    • ‘first’ : Drop duplicates except for the first occurrence.

    • ’last’ : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

  • self (_IndexT) –

Return type:

Index

See also

Series.drop_duplicates

Equivalent method on Series.

DataFrame.drop_duplicates

Equivalent method on DataFrame.

Index.duplicated

Related method on Index, indicating duplicate Index values.

Examples

Generate an pandas.Index with duplicate values.

>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'])

The keep parameter controls which duplicate values are removed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.

>>> idx.drop_duplicates(keep='first')
Index(['lama', 'cow', 'beetle', 'hippo'], dtype='object')

The value ‘last’ keeps the last occurrence for each set of duplicated entries.

>>> idx.drop_duplicates(keep='last')
Index(['cow', 'beetle', 'lama', 'hippo'], dtype='object')

The value False discards all sets of duplicated entries.

>>> idx.drop_duplicates(keep=False)
Index(['cow', 'beetle', 'hippo'], dtype='object')
duplicated(keep='first')[source]

Indicate duplicate index values.

Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.

Parameters:

keep ({'first', 'last', False}, default 'first') –

The value or values in a set of duplicates to mark as missing.

  • ’first’ : Mark duplicates as True except for the first occurrence.

  • ’last’ : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Return type:

np.ndarray[bool]

See also

Series.duplicated

Equivalent method on pandas.Series.

DataFrame.duplicated

Equivalent method on pandas.DataFrame.

Index.drop_duplicates

Remove duplicate values from Index.

Examples

By default, for each set of duplicated values, the first occurrence is set to False and all others to True:

>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> idx.duplicated()
array([False, False,  True, False,  True])

which is equivalent to

>>> idx.duplicated(keep='first')
array([False, False,  True, False,  True])

By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:

>>> idx.duplicated(keep='last')
array([ True, False,  True, False, False])

By setting keep on False, all duplicates are True:

>>> idx.duplicated(keep=False)
array([ True, False,  True, False,  True])
final union(other, sort=None)[source]

Form the union of two Index objects.

If the Index objects are incompatible, both Index objects will be cast to dtype(‘object’) first.

Parameters:
  • other (Index or array-like) –

  • sort (bool or None, default None) –

    Whether to sort the resulting Index.

    • None : Sort the result, except when

      1. self and other are equal.

      2. self or other has length 0.

      3. Some values in self or other cannot be compared. A RuntimeWarning is issued in this case.

    • False : do not sort the result.

    • True : Sort the result (which may raise TypeError).

Return type:

Index

Examples

Union matching dtypes

>>> idx1 = pd.Index([1, 2, 3, 4])
>>> idx2 = pd.Index([3, 4, 5, 6])
>>> idx1.union(idx2)
Index([1, 2, 3, 4, 5, 6], dtype='int64')

Union mismatched dtypes

>>> idx1 = pd.Index(['a', 'b', 'c', 'd'])
>>> idx2 = pd.Index([1, 2, 3, 4])
>>> idx1.union(idx2)
Index(['a', 'b', 'c', 'd', 1, 2, 3, 4], dtype='object')

MultiIndex case

>>> idx1 = pd.MultiIndex.from_arrays(
...     [[1, 1, 2, 2], ["Red", "Blue", "Red", "Blue"]]
... )
>>> idx1
MultiIndex([(1,  'Red'),
    (1, 'Blue'),
    (2,  'Red'),
    (2, 'Blue')],
   )
>>> idx2 = pd.MultiIndex.from_arrays(
...     [[3, 3, 2, 2], ["Red", "Green", "Red", "Green"]]
... )
>>> idx2
MultiIndex([(3,   'Red'),
    (3, 'Green'),
    (2,   'Red'),
    (2, 'Green')],
   )
>>> idx1.union(idx2)
MultiIndex([(1,  'Blue'),
    (1,   'Red'),
    (2,  'Blue'),
    (2, 'Green'),
    (2,   'Red'),
    (3, 'Green'),
    (3,   'Red')],
   )
>>> idx1.union(idx2, sort=False)
MultiIndex([(1,   'Red'),
    (1,  'Blue'),
    (2,   'Red'),
    (2,  'Blue'),
    (3,   'Red'),
    (3, 'Green'),
    (2, 'Green')],
   )
final intersection(other, sort=False)[source]

Form the intersection of two Index objects.

This returns a new Index with elements common to the index and other.

Parameters:
  • other (Index or array-like) –

  • sort (True, False or None, default False) –

    Whether to sort the resulting index.

    • None : sort the result, except when self and other are equal or when the values cannot be compared.

    • False : do not sort the result.

    • True : Sort the result (which may raise TypeError).

Return type:

Index

Examples

>>> idx1 = pd.Index([1, 2, 3, 4])
>>> idx2 = pd.Index([3, 4, 5, 6])
>>> idx1.intersection(idx2)
Index([3, 4], dtype='int64')
final difference(other, sort=None)[source]

Return a new Index with elements of index not in other.

This is the set difference of two Index objects.

Parameters:
  • other (Index or array-like) –

  • sort (bool or None, default None) –

    Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by pandas.

    • None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.

    • False : Do not sort the result.

    • True : Sort the result (which may raise TypeError).

Return type:

Index

Examples

>>> idx1 = pd.Index([2, 1, 3, 4])
>>> idx2 = pd.Index([3, 4, 5, 6])
>>> idx1.difference(idx2)
Index([1, 2], dtype='int64')
>>> idx1.difference(idx2, sort=False)
Index([2, 1], dtype='int64')
symmetric_difference(other, result_name=None, sort=None)[source]

Compute the symmetric difference of two Index objects.

Parameters:
  • other (Index or array-like) –

  • result_name (str) –

  • sort (bool or None, default None) –

    Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by pandas.

    • None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.

    • False : Do not sort the result.

    • True : Sort the result (which may raise TypeError).

Return type:

Index

Notes

symmetric_difference contains elements that appear in either idx1 or idx2 but not both. Equivalent to the Index created by idx1.difference(idx2) | idx2.difference(idx1) with duplicates dropped.

Examples

>>> idx1 = pd.Index([1, 2, 3, 4])
>>> idx2 = pd.Index([2, 3, 4, 5])
>>> idx1.symmetric_difference(idx2)
Index([1, 5], dtype='int64')
get_loc(key)[source]

Get integer location, slice or boolean mask for requested label.

Parameters:

key (label) –

Return type:

int if unique index, slice if monotonic index, else mask

Examples

>>> unique_index = pd.Index(list('abc'))
>>> unique_index.get_loc('b')
1
>>> monotonic_index = pd.Index(list('abbc'))
>>> monotonic_index.get_loc('b')
slice(1, 3, None)
>>> non_monotonic_index = pd.Index(list('abcb'))
>>> non_monotonic_index.get_loc('b')
array([False,  True, False,  True])
final get_indexer(target, method=None, limit=None, tolerance=None)[source]

Compute indexer and mask for new index given the current index.

The indexer should be then used as an input to ndarray.take to align the current data to the new index.

Parameters:
  • target (Index) –

  • method ({None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional) –

    • default: exact matches only.

    • pad / ffill: find the PREVIOUS index value if no exact match.

    • backfill / bfill: use NEXT index value if no exact match

    • nearest: use the NEAREST index value if no exact match. Tied distances are broken by preferring the larger index value.

  • limit (int, optional) – Maximum number of consecutive labels in target to match for inexact matches.

  • tolerance (optional) –

    Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations must satisfy the equation abs(index[indexer] - target) <= tolerance.

    Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

Returns:

Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.

Return type:

np.ndarray[np.intp]

Notes

Returns -1 for unmatched values, for further explanation see the example below.

Examples

>>> index = pd.Index(['c', 'a', 'b'])
>>> index.get_indexer(['a', 'b', 'x'])
array([ 1,  2, -1])

Notice that the return value is an array of locations in index and x is marked by -1, as it is not in index.

reindex(target, method=None, level=None, limit=None, tolerance=None)[source]

Create index with target’s values.

Parameters:
  • target (an iterable) –

  • method ({None, 'pad'/'ffill', 'backfill'/'bfill', 'nearest'}, optional) –

    • default: exact matches only.

    • pad / ffill: find the PREVIOUS index value if no exact match.

    • backfill / bfill: use NEXT index value if no exact match

    • nearest: use the NEAREST index value if no exact match. Tied distances are broken by preferring the larger index value.

  • level (int, optional) – Level of multiindex.

  • limit (int, optional) – Maximum number of consecutive labels in target to match for inexact matches.

  • tolerance (int or float, optional) –

    Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations must satisfy the equation abs(index[indexer] - target) <= tolerance.

    Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

Returns:

  • new_index (pd.Index) – Resulting index.

  • indexer (np.ndarray[np.intp] or None) – Indices of output values in original index.

Raises:
  • TypeError – If method passed along with level.

  • ValueError – If non-unique multi-index

  • ValueError – If non-unique index and method or limit passed.

Return type:

tuple[Index, npt.NDArray[np.intp] | None]

See also

Series.reindex

Conform Series to new index with optional filling logic.

DataFrame.reindex

Conform DataFrame to new index with optional filling logic.

Examples

>>> idx = pd.Index(['car', 'bike', 'train', 'tractor'])
>>> idx
Index(['car', 'bike', 'train', 'tractor'], dtype='object')
>>> idx.reindex(['car', 'bike'])
(Index(['car', 'bike'], dtype='object'), array([0, 1]))
property values: ExtensionArray | ndarray

Return an array representing the data in the Index.

Warning

We recommend using Index.array or Index.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.

Returns:

array

Return type:

numpy.ndarray or ExtensionArray

See also

Index.array

Reference to the underlying data.

Index.to_numpy

A NumPy array representing the underlying data.

array

The ExtensionArray of the data backing this Series or Index.

Returns:

An ExtensionArray of the values stored within. For extension types, this is the actual array. For NumPy native types, this is a thin (no copy) wrapper around numpy.ndarray.

.array differs .values which may require converting the data to a different form.

Return type:

ExtensionArray

See also

Index.to_numpy

Similar method that always returns a NumPy array.

Series.to_numpy

Similar method that always returns a NumPy array.

Notes

This table lays out the different array types for each extension dtype within pandas.

dtype

array type

category

Categorical

period

PeriodArray

interval

IntervalArray

IntegerNA

IntegerArray

string

StringArray

boolean

BooleanArray

datetime64[ns, tz]

DatetimeArray

For any 3rd-party extension types, the array type will be an ExtensionArray.

For all remaining dtypes .array will be a arrays.NumpyExtensionArray wrapping the actual ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then use Series.to_numpy() instead.

Examples

For regular NumPy types like int, and float, a PandasArray is returned.

>>> pd.Series([1, 2, 3]).array
<PandasArray>
[1, 2, 3]
Length: 3, dtype: int64

For extension types, like Categorical, the actual ExtensionArray is returned

>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a']))
>>> ser.array
['a', 'b', 'a']
Categories (2, object): ['a', 'b']
memory_usage(deep=False)[source]

Memory usage of the values.

Parameters:

deep (bool, default False) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption.

Return type:

bytes used

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of the array.

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False or if used on PyPy

final where(cond, other=None)[source]

Replace values where the condition is False.

The replacement is taken from other.

Parameters:
  • cond (bool array-like with the same length as self) – Condition to select the values on.

  • other (scalar, or array-like, default None) – Replacement if the condition is False.

Returns:

A copy of self with values replaced from other where the condition is False.

Return type:

pandas.Index

See also

Series.where

Same method for Series.

DataFrame.where

Same method for DataFrame.

Examples

>>> idx = pd.Index(['car', 'bike', 'train', 'tractor'])
>>> idx
Index(['car', 'bike', 'train', 'tractor'], dtype='object')
>>> idx.where(idx.isin(['car', 'train']), 'other')
Index(['car', 'other', 'train', 'other'], dtype='object')
append(other)[source]

Append a collection of Index options together.

Parameters:

other (Index or list/tuple of indices) –

Return type:

Index

putmask(mask, value)[source]

Return a new Index of the values set with the mask.

Return type:

Index

See also

numpy.ndarray.putmask

Changes elements of an array based on conditional and input values.

equals(other)[source]

Determine if two Index object are equal.

The things that are being compared are:

  • The elements inside the Index object.

  • The order of the elements inside the Index object.

Parameters:

other (Any) – The other object to compare against.

Returns:

True if “other” is an Index and it has the same elements and order as the calling index; False otherwise.

Return type:

bool

Examples

>>> idx1 = pd.Index([1, 2, 3])
>>> idx1
Index([1, 2, 3], dtype='int64')
>>> idx1.equals(pd.Index([1, 2, 3]))
True

The elements inside are compared

>>> idx2 = pd.Index(["1", "2", "3"])
>>> idx2
Index(['1', '2', '3'], dtype='object')
>>> idx1.equals(idx2)
False

The order is compared

>>> ascending_idx = pd.Index([1, 2, 3])
>>> ascending_idx
Index([1, 2, 3], dtype='int64')
>>> descending_idx = pd.Index([3, 2, 1])
>>> descending_idx
Index([3, 2, 1], dtype='int64')
>>> ascending_idx.equals(descending_idx)
False

The dtype is not compared

>>> int64_idx = pd.Index([1, 2, 3], dtype='int64')
>>> int64_idx
Index([1, 2, 3], dtype='int64')
>>> uint64_idx = pd.Index([1, 2, 3], dtype='uint64')
>>> uint64_idx
Index([1, 2, 3], dtype='uint64')
>>> int64_idx.equals(uint64_idx)
True
final identical(other)[source]

Similar to equals, but checks that object attributes and types are also equal.

Returns:

If two Index objects have equal elements and same type True, otherwise False.

Return type:

bool

final asof(label)[source]

Return the label from the index, or, if not present, the previous one.

Assuming that the index is sorted, return the passed index label if it is in the index, or return the previous index label if the passed one is not in the index.

Parameters:

label (object) – The label up to which the method returns the latest index label.

Returns:

The passed label if it is in the index. The previous label if the passed label is not in the sorted index or NaN if there is no such label.

Return type:

object

See also

Series.asof

Return the latest value in a Series up to the passed index.

merge_asof

Perform an asof merge (similar to left join but it matches on nearest key rather than equal key).

Index.get_loc

An asof is a thin wrapper around get_loc with method=’pad’.

Examples

Index.asof returns the latest index label up to the passed label.

>>> idx = pd.Index(['2013-12-31', '2014-01-02', '2014-01-03'])
>>> idx.asof('2014-01-01')
'2013-12-31'

If the label is in the index, the method returns the passed label.

>>> idx.asof('2014-01-02')
'2014-01-02'

If all of the labels in the index are later than the passed label, NaN is returned.

>>> idx.asof('1999-01-02')
nan

If the index is not sorted, an error is raised.

>>> idx_not_sorted = pd.Index(['2013-12-31', '2015-01-02',
...                            '2014-01-03'])
>>> idx_not_sorted.asof('2013-12-31')
Traceback (most recent call last):
ValueError: index must be monotonic increasing or decreasing
asof_locs(where, mask)[source]

Return the locations (indices) of labels in the index.

As in the asof function, if the label (a particular entry in where) is not in the index, the latest index label up to the passed label is chosen and its index returned.

If all of the labels in the index are later than a label in where, -1 is returned.

mask is used to ignore NA values in the index during calculation.

Parameters:
  • where (Index) – An Index consisting of an array of timestamps.

  • mask (np.ndarray[bool]) – Array of booleans denoting where values in the original data are not NA.

Returns:

An array of locations (indices) of the labels from the Index which correspond to the return values of the asof function for every element in where.

Return type:

np.ndarray[np.intp]

sort_values(return_indexer=False, ascending=True, na_position='last', key=None)[source]

Return a sorted copy of the index.

Return a sorted copy of the index, and optionally return the indices that sorted the index itself.

Parameters:
  • return_indexer (bool, default False) – Should the indices that would sort the index be returned.

  • ascending (bool, default True) – Should the index values be sorted in an ascending order.

  • na_position ({'first' or 'last'}, default 'last') –

    Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.

    New in version 1.2.0.

  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape.

    New in version 1.1.0.

Returns:

  • sorted_index (pandas.Index) – Sorted copy of the index.

  • indexer (numpy.ndarray, optional) – The indices that the index itself was sorted by.

See also

Series.sort_values

Sort values of a Series.

DataFrame.sort_values

Sort values in a DataFrame.

Examples

>>> idx = pd.Index([10, 100, 1, 1000])
>>> idx
Index([10, 100, 1, 1000], dtype='int64')

Sort values in ascending order (default behavior).

>>> idx.sort_values()
Index([1, 10, 100, 1000], dtype='int64')

Sort values in descending order, and also get the indices idx was sorted by.

>>> idx.sort_values(ascending=False, return_indexer=True)
(Index([1000, 100, 10, 1], dtype='int64'), array([3, 1, 0, 2]))
final sort(*args, **kwargs)[source]

Use sort_values instead.

shift(periods=1, freq=None)[source]

Shift index by desired number of time frequency increments.

This method is for shifting the values of datetime-like indexes by a specified time increment a given number of times.

Parameters:
  • periods (int, default 1) – Number of periods (or increments) to shift by, can be positive or negative.

  • freq (pandas.DateOffset, pandas.Timedelta or str, optional) – Frequency increment to shift by. If None, the index is shifted by its own freq attribute. Offset aliases are valid strings, e.g., ‘D’, ‘W’, ‘M’ etc.

Returns:

Shifted index.

Return type:

pandas.Index

See also

Series.shift

Shift values of Series.

Notes

This method is only implemented for datetime-like index classes, i.e., DatetimeIndex, PeriodIndex and TimedeltaIndex.

Examples

Put the first 5 month starts of 2011 into an index.

>>> month_starts = pd.date_range('1/1/2011', periods=5, freq='MS')
>>> month_starts
DatetimeIndex(['2011-01-01', '2011-02-01', '2011-03-01', '2011-04-01',
               '2011-05-01'],
              dtype='datetime64[ns]', freq='MS')

Shift the index by 10 days.

>>> month_starts.shift(10, freq='D')
DatetimeIndex(['2011-01-11', '2011-02-11', '2011-03-11', '2011-04-11',
               '2011-05-11'],
              dtype='datetime64[ns]', freq=None)

The default value of freq is the freq attribute of the index, which is ‘MS’ (month start) in this example.

>>> month_starts.shift(10)
DatetimeIndex(['2011-11-01', '2011-12-01', '2012-01-01', '2012-02-01',
               '2012-03-01'],
              dtype='datetime64[ns]', freq='MS')
argsort(*args, **kwargs)[source]

Return the integer indices that would sort the index.

Parameters:
  • *args – Passed to numpy.ndarray.argsort.

  • **kwargs – Passed to numpy.ndarray.argsort.

Returns:

Integer indices that would sort the index if used as an indexer.

Return type:

np.ndarray[np.intp]

See also

numpy.argsort

Similar method for NumPy arrays.

Index.sort_values

Return sorted copy of Index.

Examples

>>> idx = pd.Index(['b', 'a', 'd', 'c'])
>>> idx
Index(['b', 'a', 'd', 'c'], dtype='object')
>>> order = idx.argsort()
>>> order
array([1, 0, 3, 2])
>>> idx[order]
Index(['a', 'b', 'c', 'd'], dtype='object')
get_indexer_non_unique(target)[source]

Compute indexer and mask for new index given the current index.

The indexer should be then used as an input to ndarray.take to align the current data to the new index.

Parameters:

target (Index) –

Returns:

  • indexer (np.ndarray[np.intp]) – Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.

  • missing (np.ndarray[np.intp]) – An indexer into the target of the values not found. These correspond to the -1 in the indexer array.

Return type:

tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]

Examples

>>> index = pd.Index(['c', 'b', 'a', 'b', 'b'])
>>> index.get_indexer_non_unique(['b', 'b'])
(array([1, 3, 4, 1, 3, 4]), array([], dtype=int64))

In the example below there are no matched values.

>>> index = pd.Index(['c', 'b', 'a', 'b', 'b'])
>>> index.get_indexer_non_unique(['q', 'r', 't'])
(array([-1, -1, -1]), array([0, 1, 2]))

For this reason, the returned indexer contains only integers equal to -1. It demonstrates that there’s no match between the index and the target values at these positions. The mask [0, 1, 2] in the return value shows that the first, second, and third elements are missing.

Notice that the return value is a tuple contains two items. In the example below the first item is an array of locations in index. The second item is a mask shows that the first and third elements are missing.

>>> index = pd.Index(['c', 'b', 'a', 'b', 'b'])
>>> index.get_indexer_non_unique(['f', 'b', 's'])
(array([-1,  1,  3,  4, -1]), array([0, 2]))
final get_indexer_for(target)[source]

Guaranteed return of an indexer even when non-unique.

This dispatches to get_indexer or get_indexer_non_unique as appropriate.

Returns:

List of indices.

Return type:

np.ndarray[np.intp]

Examples

>>> idx = pd.Index([np.nan, 'var1', np.nan])
>>> idx.get_indexer_for([np.nan])
array([0, 2])
final groupby(values)[source]

Group the index labels by a given array of values.

Parameters:

values (array) – Values used to determine the groups.

Returns:

{group name -> group labels}

Return type:

dict

map(mapper, na_action=None)[source]

Map values using an input mapping or function.

Parameters:
  • mapper (function, dict, or Series) – Mapping correspondence.

  • na_action ({None, 'ignore'}) – If ‘ignore’, propagate NA values, without passing them to the mapping correspondence.

Returns:

The output of the mapping function applied to the index. If the function returns a tuple with more than one element a MultiIndex will be returned.

Return type:

Union[Index, MultiIndex]

isin(values, level=None)[source]

Return a boolean array where the index values are in values.

Compute boolean array of whether each index value is found in the passed set of values. The length of the returned boolean array matches the length of the index.

Parameters:
  • values (set or list-like) – Sought values.

  • level (str or int, optional) – Name or position of the index level to use (if the index is a MultiIndex).

Returns:

NumPy array of boolean values.

Return type:

np.ndarray[bool]

See also

Series.isin

Same for Series.

DataFrame.isin

Same method for DataFrames.

Notes

In the case of MultiIndex you must either specify values as a list-like object containing tuples that are the same length as the number of levels, or specify level. Otherwise it will raise a ValueError.

If level is specified:

  • if it is the name of one and only one index level, use that level;

  • otherwise it should be a number indicating level position.

Examples

>>> idx = pd.Index([1,2,3])
>>> idx
Index([1, 2, 3], dtype='int64')

Check whether each index value in a list of values.

>>> idx.isin([1, 4])
array([ True, False, False])
>>> midx = pd.MultiIndex.from_arrays([[1,2,3],
...                                  ['red', 'blue', 'green']],
...                                  names=('number', 'color'))
>>> midx
MultiIndex([(1,   'red'),
            (2,  'blue'),
            (3, 'green')],
           names=['number', 'color'])

Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.

>>> midx.isin(['red', 'orange', 'yellow'], level='color')
array([ True, False, False])

To check across the levels of a MultiIndex, pass a list of tuples:

>>> midx.isin([(1, 'red'), (3, 'red')])
array([ True, False, False])

For a DatetimeIndex, string values in values are converted to Timestamps.

>>> dates = ['2000-03-11', '2000-03-12', '2000-03-13']
>>> dti = pd.to_datetime(dates)
>>> dti
DatetimeIndex(['2000-03-11', '2000-03-12', '2000-03-13'],
dtype='datetime64[ns]', freq=None)
>>> dti.isin(['2000-03-11'])
array([ True, False, False])
slice_indexer(start=None, end=None, step=None)[source]

Compute the slice indexer for input labels and step.

Index needs to be ordered and unique.

Parameters:
  • start (label, default None) – If None, defaults to the beginning.

  • end (label, default None) – If None, defaults to the end.

  • step (int, default None) –

Return type:

slice

:raises KeyError : If key does not exist, or key is not unique and index is: not ordered.

Notes

This function assumes that the data is sorted, so use at your own peril

Examples

This is a method on all index types. For example you can do:

>>> idx = pd.Index(list('abcd'))
>>> idx.slice_indexer(start='b', end='c')
slice(1, 3, None)
>>> idx = pd.MultiIndex.from_arrays([list('abcd'), list('efgh')])
>>> idx.slice_indexer(start='b', end=('c', 'g'))
slice(1, 3, None)
get_slice_bound(label, side)[source]

Calculate slice bound that corresponds to given label.

Returns leftmost (one-past-the-rightmost if side=='right') position of given label.

Parameters:
  • label (object) –

  • side ({'left', 'right'}) –

Returns:

Index of label.

Return type:

int

slice_locs(start=None, end=None, step=None)[source]

Compute slice locations for input labels.

Parameters:
  • start (label, default None) – If None, defaults to the beginning.

  • end (label, default None) – If None, defaults to the end.

  • step (int, defaults None) – If None, defaults to 1.

Return type:

tuple[int, int]

See also

Index.get_loc

Get location for a single label.

Notes

This method only works if the index is monotonic or unique.

Examples

>>> idx = pd.Index(list('abcd'))
>>> idx.slice_locs(start='b', end='c')
(1, 3)
delete(loc)[source]

Make new Index with passed location(-s) deleted.

Parameters:
  • loc (int or list of int) – Location of item(-s) which will be deleted. Use a list of locations to delete more than one value at the same time.

  • self (_IndexT) –

Returns:

Will be same type as self, except for RangeIndex.

Return type:

Index

See also

numpy.delete

Delete any rows and column from NumPy array (ndarray).

Examples

>>> idx = pd.Index(['a', 'b', 'c'])
>>> idx.delete(1)
Index(['a', 'c'], dtype='object')
>>> idx = pd.Index(['a', 'b', 'c'])
>>> idx.delete([0, 2])
Index(['b'], dtype='object')
insert(loc, item)[source]

Make new Index inserting new item at location.

Follows Python numpy.insert semantics for negative values.

Parameters:
Return type:

Index

drop(labels, errors='raise')[source]

Make new Index with passed list of labels deleted.

Parameters:
  • labels (array-like or scalar) –

  • errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and existing labels are dropped.

Returns:

Will be same type as self, except for RangeIndex.

Return type:

Index

Raises:

KeyError – If not all of the labels are found in the selected axis

infer_objects(copy=True)[source]

If we have an object dtype, try to infer a non-object dtype.

Parameters:

copy (bool, default True) – Whether to make a copy in cases where no inference occurs.

Return type:

Index

any(*args, **kwargs)[source]

Return whether any element is Truthy.

Parameters:
  • *args – Required for compatibility with numpy.

  • **kwargs – Required for compatibility with numpy.

Returns:

A single element array-like may be converted to bool.

Return type:

bool or array-like (if axis is specified)

See also

Index.all

Return whether all elements are True.

Series.all

Return whether all elements are True.

Notes

Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.

Examples

>>> index = pd.Index([0, 1, 2])
>>> index.any()
True
>>> index = pd.Index([0, 0, 0])
>>> index.any()
False
all(*args, **kwargs)[source]

Return whether all elements are Truthy.

Parameters:
  • *args – Required for compatibility with numpy.

  • **kwargs – Required for compatibility with numpy.

Returns:

A single element array-like may be converted to bool.

Return type:

bool or array-like (if axis is specified)

See also

Index.any

Return whether any element in an Index is True.

Series.any

Return whether any element in a Series is True.

Series.all

Return whether all elements in a Series are True.

Notes

Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.

Examples

True, because nonzero integers are considered True.

>>> pd.Index([1, 2, 3]).all()
True

False, because 0 is considered False.

>>> pd.Index([0, 1, 2]).all()
False
argmin(axis=None, skipna=True, *args, **kwargs)[source]

Return int position of the smallest value in the Series.

If the minimum is achieved in multiple locations, the first row position is returned.

Parameters:
  • axis ({None}) – Unused. Parameter needed for compatibility with DataFrame.

  • skipna (bool, default True) – Exclude NA/null values when showing the result.

  • *args – Additional arguments and keywords for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords for compatibility with NumPy.

Returns:

Row position of the minimum value.

Return type:

int

See also

Series.argmin

Return position of the minimum value.

Series.argmax

Return position of the maximum value.

numpy.ndarray.argmin

Equivalent method for numpy arrays.

Series.idxmax

Return index label of the maximum values.

Series.idxmin

Return index label of the minimum values.

Examples

Consider dataset containing cereal calories

>>> s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
...                'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})
>>> s
Corn Flakes              100.0
Almond Delight           110.0
Cinnamon Toast Crunch    120.0
Cocoa Puff               110.0
dtype: float64
>>> s.argmax()
2
>>> s.argmin()
0

The maximum cereal calories is the third element and the minimum cereal calories is the first element, since series is zero-indexed.

argmax(axis=None, skipna=True, *args, **kwargs)[source]

Return int position of the largest value in the Series.

If the maximum is achieved in multiple locations, the first row position is returned.

Parameters:
  • axis ({None}) – Unused. Parameter needed for compatibility with DataFrame.

  • skipna (bool, default True) – Exclude NA/null values when showing the result.

  • *args – Additional arguments and keywords for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords for compatibility with NumPy.

Returns:

Row position of the maximum value.

Return type:

int

See also

Series.argmax

Return position of the maximum value.

Series.argmin

Return position of the minimum value.

numpy.ndarray.argmax

Equivalent method for numpy arrays.

Series.idxmax

Return index label of the maximum values.

Series.idxmin

Return index label of the minimum values.

Examples

Consider dataset containing cereal calories

>>> s = pd.Series({'Corn Flakes': 100.0, 'Almond Delight': 110.0,
...                'Cinnamon Toast Crunch': 120.0, 'Cocoa Puff': 110.0})
>>> s
Corn Flakes              100.0
Almond Delight           110.0
Cinnamon Toast Crunch    120.0
Cocoa Puff               110.0
dtype: float64
>>> s.argmax()
2
>>> s.argmin()
0

The maximum cereal calories is the third element and the minimum cereal calories is the first element, since series is zero-indexed.

min(axis=None, skipna=True, *args, **kwargs)[source]

Return the minimum value of the Index.

Parameters:
  • axis ({None}) – Dummy argument for consistency with Series.

  • skipna (bool, default True) – Exclude NA/null values when showing the result.

  • *args – Additional arguments and keywords for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords for compatibility with NumPy.

Returns:

Minimum value.

Return type:

scalar

See also

Index.max

Return the maximum value of the object.

Series.min

Return the minimum value in a Series.

DataFrame.min

Return the minimum values in a DataFrame.

Examples

>>> idx = pd.Index([3, 2, 1])
>>> idx.min()
1
>>> idx = pd.Index(['c', 'b', 'a'])
>>> idx.min()
'a'

For a MultiIndex, the minimum is determined lexicographically.

>>> idx = pd.MultiIndex.from_product([('a', 'b'), (2, 1)])
>>> idx.min()
('a', 1)
max(axis=None, skipna=True, *args, **kwargs)[source]

Return the maximum value of the Index.

Parameters:
  • axis (int, optional) – For compatibility with NumPy. Only 0 or None are allowed.

  • skipna (bool, default True) – Exclude NA/null values when showing the result.

  • *args – Additional arguments and keywords for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords for compatibility with NumPy.

Returns:

Maximum value.

Return type:

scalar

See also

Index.min

Return the minimum value in an Index.

Series.max

Return the maximum value in a Series.

DataFrame.max

Return the maximum values in a DataFrame.

Examples

>>> idx = pd.Index([3, 2, 1])
>>> idx.max()
3
>>> idx = pd.Index(['c', 'b', 'a'])
>>> idx.max()
'c'

For a MultiIndex, the maximum is determined lexicographically.

>>> idx = pd.MultiIndex.from_product([('a', 'b'), (2, 1)])
>>> idx.max()
('b', 2)
property shape: Tuple[int, ...]

Return a tuple of the shape of the underlying data.

class pandas.Int16Dtype[source]

An ExtensionDtype for int16 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of int16

name: str = 'Int16'
class pandas.Int32Dtype[source]

An ExtensionDtype for int32 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of int32

name: str = 'Int32'
class pandas.Int64Dtype[source]

An ExtensionDtype for int64 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of int64

name: str = 'Int64'
class pandas.Int8Dtype[source]

An ExtensionDtype for int8 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of int8

name: str = 'Int8'
class pandas.Interval

Immutable object implementing an Interval, a bounded slice-like interval.

Parameters:
  • left (orderable scalar) – Left bound for the interval.

  • right (orderable scalar) – Right bound for the interval.

  • closed ({'right', 'left', 'both', 'neither'}, default 'right') – Whether the interval is closed on the left-side, right-side, both or neither. See the Notes for more detailed explanation.

See also

IntervalIndex

An Index of Interval objects that are all closed on the same side.

cut

Convert continuous data into discrete bins (Categorical of Interval objects).

qcut

Convert continuous data into bins (Categorical of Interval objects) based on quantiles.

Period

Represents a period of time.

Notes

The parameters left and right must be from the same type, you must be able to compare them and they must satisfy left <= right.

A closed interval (in mathematics denoted by square brackets) contains its endpoints, i.e. the closed interval [0, 5] is characterized by the conditions 0 <= x <= 5. This is what closed='both' stands for. An open interval (in mathematics denoted by parentheses) does not contain its endpoints, i.e. the open interval (0, 5) is characterized by the conditions 0 < x < 5. This is what closed='neither' stands for. Intervals can also be half-open or half-closed, i.e. [0, 5) is described by 0 <= x < 5 (closed='left') and (0, 5] is described by 0 < x <= 5 (closed='right').

Examples

It is possible to build Intervals of different types, like numeric ones:

>>> iv = pd.Interval(left=0, right=5)
>>> iv
Interval(0, 5, closed='right')

You can check if an element belongs to it, or if it contains another interval:

>>> 2.5 in iv
True
>>> pd.Interval(left=2, right=5, closed='both') in iv
True

You can test the bounds (closed='right', so 0 < x <= 5):

>>> 0 in iv
False
>>> 5 in iv
True
>>> 0.0001 in iv
True

Calculate its length

>>> iv.length
5

You can operate with + and * over an Interval and the operation is applied to each of its bounds, so the result depends on the type of the bound elements

>>> shifted_iv = iv + 3
>>> shifted_iv
Interval(3, 8, closed='right')
>>> extended_iv = iv * 10.0
>>> extended_iv
Interval(0.0, 50.0, closed='right')

To create a time interval you can use Timestamps as the bounds

>>> year_2017 = pd.Interval(pd.Timestamp('2017-01-01 00:00:00'),
...                         pd.Timestamp('2018-01-01 00:00:00'),
...                         closed='left')
>>> pd.Timestamp('2017-01-01 00:00') in year_2017
True
>>> year_2017.length
Timedelta('365 days 00:00:00')
closed

String describing the inclusive side the intervals.

Either left, right, both or neither.

left

Left bound for the interval.

overlaps()

Check whether two Interval objects overlap.

Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.

Parameters:

other (Interval) – Interval to check against for an overlap.

Returns:

True if the two intervals overlap.

Return type:

bool

See also

IntervalArray.overlaps

The corresponding method for IntervalArray.

IntervalIndex.overlaps

The corresponding method for IntervalIndex.

Examples

>>> i1 = pd.Interval(0, 2)
>>> i2 = pd.Interval(1, 3)
>>> i1.overlaps(i2)
True
>>> i3 = pd.Interval(4, 5)
>>> i1.overlaps(i3)
False

Intervals that share closed endpoints overlap:

>>> i4 = pd.Interval(0, 1, closed='both')
>>> i5 = pd.Interval(1, 2, closed='both')
>>> i4.overlaps(i5)
True

Intervals that only have an open endpoint in common do not overlap:

>>> i6 = pd.Interval(1, 2, closed='neither')
>>> i4.overlaps(i6)
False
right

Right bound for the interval.

class pandas.IntervalDtype[source]

An ExtensionDtype for Interval data.

This is not an actual numpy dtype, but a duck type.

Parameters:
  • subtype (str, np.dtype) – The dtype of the Interval bounds.

  • closed (str_type | None) –

subtype
None()

Examples

>>> pd.IntervalDtype(subtype='int64', closed='both')
interval[int64, both]
name = 'interval'
kind: str = 'O'
str: str = '|O08'
base: dtype | ExtensionDtype | None = dtype('O')
num = 103
property closed
property subtype

The dtype of the Interval bounds.

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

classmethod construct_from_string(string)[source]

attempt to construct this type from a string, raise a TypeError if its not possible

Parameters:

string (str) –

Return type:

IntervalDtype

property type: type[pandas._libs.interval.Interval]

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

classmethod is_dtype(dtype)[source]

Return a boolean if we if the passed type is an actual dtype that we can match (via string or type)

Parameters:

dtype (object) –

Return type:

bool

class pandas.IntervalIndex[source]

Immutable index of intervals that are closed on the same side.

New in version 0.20.0.

Parameters:
  • data (array-like (1-dimensional)) – Array-like (ndarray, DateTimeArray, TimeDeltaArray) containing Interval objects from which to build the IntervalIndex.

  • closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.

  • dtype (dtype or None, default None) – If None, dtype will be inferred.

  • copy (bool, default False) – Copy the input data.

  • name (object, optional) – Name to be stored in the index.

  • verify_integrity (bool, default True) – Verify that the IntervalIndex is valid.

Return type:

IntervalIndex

left
right
closed
Type:

IntervalClosedType

mid
length
is_empty
is_non_overlapping_monotonic
Type:

bool

is_overlapping
values
from_arrays()[source]
Parameters:
  • closed (IntervalClosedType) –

  • name (Hashable) –

  • copy (bool) –

  • dtype (Dtype | None) –

Return type:

IntervalIndex

from_tuples()[source]
Parameters:
  • closed (IntervalClosedType) –

  • name (Hashable) –

  • copy (bool) –

  • dtype (Dtype | None) –

Return type:

IntervalIndex

from_breaks()[source]
Parameters:
  • closed (IntervalClosedType | None) –

  • name (Hashable) –

  • copy (bool) –

  • dtype (Dtype | None) –

Return type:

IntervalIndex

contains()
overlaps()
set_closed()
to_tuples()

See also

Index

The base pandas Index type.

Interval

A bounded slice-like interval; the elements of an IntervalIndex.

interval_range

Function to create a fixed frequency IntervalIndex.

cut

Bin values into discrete Intervals.

qcut

Bin values into equal-sized Intervals based on rank or sample quantiles.

Notes

See the user guide for more.

Examples

A new IntervalIndex is typically constructed using interval_range():

>>> pd.interval_range(start=0, end=5)
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              dtype='interval[int64, right]')

It may also be constructed using one of the constructor methods: IntervalIndex.from_arrays(), IntervalIndex.from_breaks(), and IntervalIndex.from_tuples().

See further examples in the doc strings of interval_range and the mentioned constructor methods.

closed: IntervalClosedType

String describing the inclusive side the intervals.

Either left, right, both or neither.

is_non_overlapping_monotonic: bool

Return a boolean whether the IntervalArray is non-overlapping and monotonic.

Non-overlapping means (no Intervals share points), and monotonic means either monotonic increasing or monotonic decreasing.

property closed_left

Check if the interval is closed on the left side.

For the meaning of closed and open see Interval.

Returns:

True if the Interval is closed on the left-side.

Return type:

bool

See also

Interval.closed_right

Check if the interval is closed on the right side.

Interval.open_left

Boolean inverse of closed_left.

Examples

>>> iv = pd.Interval(0, 5, closed='left')
>>> iv.closed_left
True
>>> iv = pd.Interval(0, 5, closed='right')
>>> iv.closed_left
False
property closed_right

Check if the interval is closed on the right side.

For the meaning of closed and open see Interval.

Returns:

True if the Interval is closed on the left-side.

Return type:

bool

See also

Interval.closed_left

Check if the interval is closed on the left side.

Interval.open_right

Boolean inverse of closed_right.

Examples

>>> iv = pd.Interval(0, 5, closed='both')
>>> iv.closed_right
True
>>> iv = pd.Interval(0, 5, closed='left')
>>> iv.closed_right
False
property open_left

Check if the interval is open on the left side.

For the meaning of closed and open see Interval.

Returns:

True if the Interval is not closed on the left-side.

Return type:

bool

See also

Interval.open_right

Check if the interval is open on the right side.

Interval.closed_left

Boolean inverse of open_left.

Examples

>>> iv = pd.Interval(0, 5, closed='neither')
>>> iv.open_left
True
>>> iv = pd.Interval(0, 5, closed='both')
>>> iv.open_left
False
property open_right

Check if the interval is open on the right side.

For the meaning of closed and open see Interval.

Returns:

True if the Interval is not closed on the left-side.

Return type:

bool

See also

Interval.open_left

Check if the interval is open on the left side.

Interval.closed_right

Boolean inverse of open_right.

Examples

>>> iv = pd.Interval(0, 5, closed='left')
>>> iv.open_right
True
>>> iv = pd.Interval(0, 5)
>>> iv.open_right
False
classmethod from_breaks(breaks, closed='right', name=None, copy=False, dtype=None)[source]

Construct an IntervalIndex from an array of splits.

Parameters:
  • breaks (array-like (1-dimensional)) – Left and right bounds for each interval.

  • closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.

  • name (str, optional) – Name of the resulting IntervalIndex.

  • copy (bool, default False) – Copy the data.

  • dtype (dtype or None, default None) – If None, dtype will be inferred.

Return type:

IntervalIndex

See also

interval_range

Function to create a fixed frequency IntervalIndex.

IntervalIndex.from_arrays

Construct from a left and right array.

IntervalIndex.from_tuples

Construct from a sequence of tuples.

Examples

>>> pd.IntervalIndex.from_breaks([0, 1, 2, 3])
IntervalIndex([(0, 1], (1, 2], (2, 3]],
              dtype='interval[int64, right]')
classmethod from_arrays(left, right, closed='right', name=None, copy=False, dtype=None)[source]

Construct from two arrays defining the left and right bounds.

Parameters:
  • left (array-like (1-dimensional)) – Left bounds for each interval.

  • right (array-like (1-dimensional)) – Right bounds for each interval.

  • closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.

  • name (str, optional) – Name of the resulting IntervalIndex.

  • copy (bool, default False) – Copy the data.

  • dtype (dtype, optional) – If None, dtype will be inferred.

Return type:

IntervalIndex

Raises:

ValueError – When a value is missing in only one of left or right. When a value in left is greater than the corresponding value in right.

See also

interval_range

Function to create a fixed frequency IntervalIndex.

IntervalIndex.from_breaks

Construct an IntervalIndex from an array of splits.

IntervalIndex.from_tuples

Construct an IntervalIndex from an array-like of tuples.

Notes

Each element of left must be less than or equal to the right element at the same position. If an element is missing, it must be missing in both left and right. A TypeError is raised when using an unsupported type for left or right. At the moment, ‘category’, ‘object’, and ‘string’ subtypes are not supported.

Examples

>>> pd.IntervalIndex.from_arrays([0, 1, 2], [1, 2, 3])
IntervalIndex([(0, 1], (1, 2], (2, 3]],
              dtype='interval[int64, right]')
classmethod from_tuples(data, closed='right', name=None, copy=False, dtype=None)[source]

Construct an IntervalIndex from an array-like of tuples.

Parameters:
  • data (array-like (1-dimensional)) – Array of tuples.

  • closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.

  • name (str, optional) – Name of the resulting IntervalIndex.

  • copy (bool, default False) – By-default copy the data, this is compat only and ignored.

  • dtype (dtype or None, default None) – If None, dtype will be inferred.

Return type:

IntervalIndex

See also

interval_range

Function to create a fixed frequency IntervalIndex.

IntervalIndex.from_arrays

Construct an IntervalIndex from a left and right array.

IntervalIndex.from_breaks

Construct an IntervalIndex from an array of splits.

Examples

>>> pd.IntervalIndex.from_tuples([(0, 1), (1, 2)])
IntervalIndex([(0, 1], (1, 2]],
               dtype='interval[int64, right]')
property inferred_type: str

Return a string of the type inferred from the values

memory_usage(deep=False)[source]

Memory usage of the values.

Parameters:

deep (bool, default False) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption.

Return type:

bytes used

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of the array.

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False or if used on PyPy

is_monotonic_decreasing

Return True if the IntervalIndex is monotonic decreasing (only equal or decreasing values), else False

is_unique

Return True if the IntervalIndex contains unique elements, else False.

property is_overlapping: bool

Return True if the IntervalIndex has overlapping intervals, else False.

Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.

Returns:

Boolean indicating if the IntervalIndex has overlapping intervals.

Return type:

bool

See also

Interval.overlaps

Check whether two Interval objects overlap.

IntervalIndex.overlaps

Check an IntervalIndex elementwise for overlaps.

Examples

>>> index = pd.IntervalIndex.from_tuples([(0, 2), (1, 3), (4, 5)])
>>> index
IntervalIndex([(0, 2], (1, 3], (4, 5]],
      dtype='interval[int64, right]')
>>> index.is_overlapping
True

Intervals that share closed endpoints overlap:

>>> index = pd.interval_range(0, 3, closed='both')
>>> index
IntervalIndex([[0, 1], [1, 2], [2, 3]],
      dtype='interval[int64, both]')
>>> index.is_overlapping
True

Intervals that only have an open endpoint in common do not overlap:

>>> index = pd.interval_range(0, 3, closed='left')
>>> index
IntervalIndex([[0, 1), [1, 2), [2, 3)],
      dtype='interval[int64, left]')
>>> index.is_overlapping
False
get_loc(key)[source]

Get integer location, slice or boolean mask for requested label.

Parameters:

key (label) –

Return type:

int if unique index, slice if monotonic index, else mask

Examples

>>> i1, i2 = pd.Interval(0, 1), pd.Interval(1, 2)
>>> index = pd.IntervalIndex([i1, i2])
>>> index.get_loc(1)
0

You can also supply a point inside an interval.

>>> index.get_loc(1.5)
1

If a label is in several intervals, you get the locations of all the relevant intervals.

>>> i3 = pd.Interval(0, 2)
>>> overlapping_index = pd.IntervalIndex([i1, i2, i3])
>>> overlapping_index.get_loc(0.5)
array([ True, False,  True])

Only exact matches will be returned if an interval is provided.

>>> index.get_loc(pd.Interval(0, 1))
0
get_indexer_non_unique(target)[source]

Compute indexer and mask for new index given the current index.

The indexer should be then used as an input to ndarray.take to align the current data to the new index.

Parameters:

target (IntervalIndex or list of Intervals) –

Returns:

  • indexer (np.ndarray[np.intp]) – Integers from 0 to n - 1 indicating that the index at these positions matches the corresponding target values. Missing values in the target are marked by -1.

  • missing (np.ndarray[np.intp]) – An indexer into the target of the values not found. These correspond to the -1 in the indexer array.

Return type:

tuple[npt.NDArray[np.intp], npt.NDArray[np.intp]]

Examples

>>> index = pd.Index(['c', 'b', 'a', 'b', 'b'])
>>> index.get_indexer_non_unique(['b', 'b'])
(array([1, 3, 4, 1, 3, 4]), array([], dtype=int64))

In the example below there are no matched values.

>>> index = pd.Index(['c', 'b', 'a', 'b', 'b'])
>>> index.get_indexer_non_unique(['q', 'r', 't'])
(array([-1, -1, -1]), array([0, 1, 2]))

For this reason, the returned indexer contains only integers equal to -1. It demonstrates that there’s no match between the index and the target values at these positions. The mask [0, 1, 2] in the return value shows that the first, second, and third elements are missing.

Notice that the return value is a tuple contains two items. In the example below the first item is an array of locations in index. The second item is a mask shows that the first and third elements are missing.

>>> index = pd.Index(['c', 'b', 'a', 'b', 'b'])
>>> index.get_indexer_non_unique(['f', 'b', 's'])
(array([-1,  1,  3,  4, -1]), array([0, 2]))
left
right
mid
property length: Index
contains(*args, **kwargs)

Check elementwise if the Intervals contain the value.

Return a boolean mask whether the value is contained in the Intervals of the IntervalArray.

Parameters:

other (scalar) – The value to check whether it is contained in the Intervals.

Return type:

boolean array

See also

Interval.contains

Check whether Interval object contains value.

IntervalArray.overlaps

Check if an Interval overlaps the values in the IntervalArray.

Examples

>>> intervals = pd.arrays.IntervalArray.from_tuples([(0, 1), (1, 3), (2, 4)])
>>> intervals
<IntervalArray>
[(0, 1], (1, 3], (2, 4]]
Length: 3, dtype: interval[int64, right]
>>> intervals.contains(0.5)
array([ True, False, False])
property is_empty

Indicates if an interval is empty, meaning it contains no points.

Returns:

A boolean indicating if a scalar Interval is empty, or a boolean ndarray positionally indicating if an Interval in an IntervalArray or IntervalIndex is empty.

Return type:

bool or ndarray

See also

Interval.length

Return the length of the Interval.

Examples

An Interval that contains points is not empty:

>>> pd.Interval(0, 1, closed='right').is_empty
False

An Interval that does not contain any points is empty:

>>> pd.Interval(0, 0, closed='right').is_empty
True
>>> pd.Interval(0, 0, closed='left').is_empty
True
>>> pd.Interval(0, 0, closed='neither').is_empty
True

An Interval that contains a single point is not empty:

>>> pd.Interval(0, 0, closed='both').is_empty
False

An IntervalArray or IntervalIndex returns a boolean ndarray positionally indicating if an Interval is empty:

>>> ivs = [pd.Interval(0, 0, closed='neither'),
...        pd.Interval(1, 2, closed='neither')]
>>> pd.arrays.IntervalArray(ivs).is_empty
array([ True, False])

Missing values are not considered empty:

>>> ivs = [pd.Interval(0, 0, closed='neither'), np.nan]
>>> pd.IntervalIndex(ivs).is_empty
array([ True, False])
overlaps(*args, **kwargs)

Check elementwise if an Interval overlaps the values in the IntervalArray.

Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have an open endpoint in common do not overlap.

Parameters:

other (IntervalArray) – Interval to check against for an overlap.

Returns:

Boolean array positionally indicating where an overlap occurs.

Return type:

ndarray

See also

Interval.overlaps

Check whether two Interval objects overlap.

Examples

>>> data = [(0, 1), (1, 3), (2, 4)]
>>> intervals = pd.arrays.IntervalArray.from_tuples(data)
>>> intervals
<IntervalArray>
[(0, 1], (1, 3], (2, 4]]
Length: 3, dtype: interval[int64, right]
>>> intervals.overlaps(pd.Interval(0.5, 1.5))
array([ True,  True, False])

Intervals that share closed endpoints overlap:

>>> intervals.overlaps(pd.Interval(1, 3, closed='left'))
array([ True,  True, True])

Intervals that only have an open endpoint in common do not overlap:

>>> intervals.overlaps(pd.Interval(1, 2, closed='right'))
array([False,  True, False])
set_closed(*args, **kwargs)

Return an identical IntervalArray closed on the specified side.

Parameters:

closed ({'left', 'right', 'both', 'neither'}) – Whether the intervals are closed on the left-side, right-side, both or neither.

Return type:

IntervalArray

Examples

>>> index = pd.arrays.IntervalArray.from_breaks(range(4))
>>> index
<IntervalArray>
[(0, 1], (1, 2], (2, 3]]
Length: 3, dtype: interval[int64, right]
>>> index.set_closed('both')
<IntervalArray>
[[0, 1], [1, 2], [2, 3]]
Length: 3, dtype: interval[int64, both]
to_tuples(*args, **kwargs)

Return an ndarray of tuples of the form (left, right).

Parameters:

na_tuple (bool, default True) – Returns NA as a tuple if True, (nan, nan), or just as the NA value itself if False, nan.

Returns:

tuples

Return type:

ndarray

class pandas.MultiIndex[source]

A multi-level, or hierarchical, index object for pandas objects.

Parameters:
  • levels (sequence of arrays) – The unique labels for each level.

  • codes (sequence of arrays) – Integers for each level designating which label at each location.

  • sortorder (optional int) – Level of sortedness (must be lexicographically sorted by that level).

  • names (optional sequence of objects) – Names for each of the index levels. (name is accepted for compat).

  • copy (bool, default False) – Copy the meta-data.

  • verify_integrity (bool, default True) – Check that the levels/codes are consistent and valid.

Return type:

MultiIndex

names
levels
codes
nlevels
levshape
dtypes
from_arrays()[source]
Parameters:

names (Sequence[Hashable] | Hashable | Literal[<no_default>]) –

Return type:

MultiIndex

from_tuples()[source]
Parameters:
Return type:

MultiIndex

from_product()[source]
Parameters:
Return type:

MultiIndex

from_frame()[source]
Parameters:

df (DataFrame) –

Return type:

MultiIndex

set_levels()[source]
Parameters:

verify_integrity (bool) –

Return type:

MultiIndex

set_codes()[source]
Parameters:

verify_integrity (bool) –

to_frame()[source]
Parameters:
  • index (bool) –

  • allow_duplicates (bool) –

Return type:

DataFrame

to_flat_index()[source]
Return type:

Index

sortlevel()[source]
Parameters:
  • level (IndexLabel) –

  • ascending (bool | list[bool]) –

  • sort_remaining (bool) –

Return type:

tuple[MultiIndex, npt.NDArray[np.intp]]

droplevel()
swaplevel()[source]
Return type:

MultiIndex

reorder_levels()[source]
Return type:

MultiIndex

remove_unused_levels()[source]
Return type:

MultiIndex

get_level_values()[source]
get_indexer()
get_loc()[source]
get_locs()[source]
get_loc_level()[source]
Parameters:
drop()[source]
Parameters:
Return type:

MultiIndex

See also

MultiIndex.from_arrays

Convert list of arrays to MultiIndex.

MultiIndex.from_product

Create a MultiIndex from the cartesian product of iterables.

MultiIndex.from_tuples

Convert list of tuples to a MultiIndex.

MultiIndex.from_frame

Make a MultiIndex from a DataFrame.

Index

The base pandas Index type.

Notes

See the user guide for more.

Examples

A new MultiIndex is typically constructed using one of the helper methods MultiIndex.from_arrays(), MultiIndex.from_product() and MultiIndex.from_tuples(). For example (using .from_arrays):

>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])

See further examples for how to construct a MultiIndex in the doc strings of the mentioned helper methods.

sortorder: int | None
classmethod from_arrays(arrays, sortorder=None, names=_NoDefault.no_default)[source]

Convert arrays to MultiIndex.

Parameters:
  • arrays (list / sequence of array-likes) – Each array-like gives one level’s value for each data point. len(arrays) is the number of levels.

  • sortorder (int or None) – Level of sortedness (must be lexicographically sorted by that level).

  • names (list / sequence of str, optional) – Names for the levels in the index.

Return type:

MultiIndex

See also

MultiIndex.from_tuples

Convert list of tuples to MultiIndex.

MultiIndex.from_product

Make a MultiIndex from cartesian product of iterables.

MultiIndex.from_frame

Make a MultiIndex from a DataFrame.

Examples

>>> arrays = [[1, 1, 2, 2], ['red', 'blue', 'red', 'blue']]
>>> pd.MultiIndex.from_arrays(arrays, names=('number', 'color'))
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])
classmethod from_tuples(tuples, sortorder=None, names=None)[source]

Convert list of tuples to MultiIndex.

Parameters:
  • tuples (list / sequence of tuple-likes) – Each tuple is the index of one row/column.

  • sortorder (int or None) – Level of sortedness (must be lexicographically sorted by that level).

  • names (list / sequence of str, optional) – Names for the levels in the index.

Return type:

MultiIndex

See also

MultiIndex.from_arrays

Convert list of arrays to MultiIndex.

MultiIndex.from_product

Make a MultiIndex from cartesian product of iterables.

MultiIndex.from_frame

Make a MultiIndex from a DataFrame.

Examples

>>> tuples = [(1, 'red'), (1, 'blue'),
...           (2, 'red'), (2, 'blue')]
>>> pd.MultiIndex.from_tuples(tuples, names=('number', 'color'))
MultiIndex([(1,  'red'),
            (1, 'blue'),
            (2,  'red'),
            (2, 'blue')],
           names=['number', 'color'])
classmethod from_product(iterables, sortorder=None, names=_NoDefault.no_default)[source]

Make a MultiIndex from the cartesian product of multiple iterables.

Parameters:
  • iterables (list / sequence of iterables) – Each iterable has unique labels for each level of the index.

  • sortorder (int or None) – Level of sortedness (must be lexicographically sorted by that level).

  • names (list / sequence of str, optional) – Names for the levels in the index. If not explicitly provided, names will be inferred from the elements of iterables if an element has a name attribute.

Return type:

MultiIndex

See also

MultiIndex.from_arrays

Convert list of arrays to MultiIndex.

MultiIndex.from_tuples

Convert list of tuples to MultiIndex.

MultiIndex.from_frame

Make a MultiIndex from a DataFrame.

Examples

>>> numbers = [0, 1, 2]
>>> colors = ['green', 'purple']
>>> pd.MultiIndex.from_product([numbers, colors],
...                            names=['number', 'color'])
MultiIndex([(0,  'green'),
            (0, 'purple'),
            (1,  'green'),
            (1, 'purple'),
            (2,  'green'),
            (2, 'purple')],
           names=['number', 'color'])
classmethod from_frame(df, sortorder=None, names=None)[source]

Make a MultiIndex from a DataFrame.

Parameters:
  • df (DataFrame) – DataFrame to be converted to MultiIndex.

  • sortorder (int, optional) – Level of sortedness (must be lexicographically sorted by that level).

  • names (list-like, optional) – If no names are provided, use the column names, or tuple of column names if the columns is a MultiIndex. If a sequence, overwrite names with the given sequence.

Returns:

The MultiIndex representation of the given DataFrame.

Return type:

MultiIndex

See also

MultiIndex.from_arrays

Convert list of arrays to MultiIndex.

MultiIndex.from_tuples

Convert list of tuples to MultiIndex.

MultiIndex.from_product

Make a MultiIndex from cartesian product of iterables.

Examples

>>> df = pd.DataFrame([['HI', 'Temp'], ['HI', 'Precip'],
...                    ['NJ', 'Temp'], ['NJ', 'Precip']],
...                   columns=['a', 'b'])
>>> df
      a       b
0    HI    Temp
1    HI  Precip
2    NJ    Temp
3    NJ  Precip
>>> pd.MultiIndex.from_frame(df)
MultiIndex([('HI',   'Temp'),
            ('HI', 'Precip'),
            ('NJ',   'Temp'),
            ('NJ', 'Precip')],
           names=['a', 'b'])

Using explicit names, instead of the column names

>>> pd.MultiIndex.from_frame(df, names=['state', 'observation'])
MultiIndex([('HI',   'Temp'),
            ('HI', 'Precip'),
            ('NJ',   'Temp'),
            ('NJ', 'Precip')],
           names=['state', 'observation'])
property values: ndarray

Return an array representing the data in the Index.

Warning

We recommend using Index.array or Index.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.

Returns:

array

Return type:

numpy.ndarray or ExtensionArray

See also

Index.array

Reference to the underlying data.

Index.to_numpy

A NumPy array representing the underlying data.

property array

Raises a ValueError for MultiIndex because there’s no single array backing a MultiIndex.

Raises:

ValueError

dtypes

Return the dtypes as a Series for the underlying MultiIndex.

property size: int

Return the number of elements in the underlying data.

levels
set_levels(levels, *, level=None, verify_integrity=True)[source]

Set new levels on MultiIndex. Defaults to returning new index.

Parameters:
  • levels (sequence or list of sequence) – New level(s) to apply.

  • level (int, level name, or sequence of int/level names (default None)) – Level(s) to set (None for all levels).

  • verify_integrity (bool, default True) – If True, checks that levels and codes are compatible.

Return type:

MultiIndex

Examples

>>> idx = pd.MultiIndex.from_tuples(
...     [
...         (1, "one"),
...         (1, "two"),
...         (2, "one"),
...         (2, "two"),
...         (3, "one"),
...         (3, "two")
...     ],
...     names=["foo", "bar"]
... )
>>> idx
MultiIndex([(1, 'one'),
    (1, 'two'),
    (2, 'one'),
    (2, 'two'),
    (3, 'one'),
    (3, 'two')],
   names=['foo', 'bar'])
>>> idx.set_levels([['a', 'b', 'c'], [1, 2]])
MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2),
            ('c', 1),
            ('c', 2)],
           names=['foo', 'bar'])
>>> idx.set_levels(['a', 'b', 'c'], level=0)
MultiIndex([('a', 'one'),
            ('a', 'two'),
            ('b', 'one'),
            ('b', 'two'),
            ('c', 'one'),
            ('c', 'two')],
           names=['foo', 'bar'])
>>> idx.set_levels(['a', 'b'], level='bar')
MultiIndex([(1, 'a'),
            (1, 'b'),
            (2, 'a'),
            (2, 'b'),
            (3, 'a'),
            (3, 'b')],
           names=['foo', 'bar'])

If any of the levels passed to set_levels() exceeds the existing length, all of the values from that argument will be stored in the MultiIndex levels, though the values will be truncated in the MultiIndex output.

>>> idx.set_levels([['a', 'b', 'c'], [1, 2, 3, 4]], level=[0, 1])
MultiIndex([('a', 1),
    ('a', 2),
    ('b', 1),
    ('b', 2),
    ('c', 1),
    ('c', 2)],
   names=['foo', 'bar'])
>>> idx.set_levels([['a', 'b', 'c'], [1, 2, 3, 4]], level=[0, 1]).levels
FrozenList([['a', 'b', 'c'], [1, 2, 3, 4]])
property nlevels: int

Integer number of levels in this MultiIndex.

Examples

>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']])
>>> mi
MultiIndex([('a', 'b', 'c')],
           )
>>> mi.nlevels
3
property levshape: Tuple[int, ...]

A tuple with the length of each level.

Examples

>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']])
>>> mi
MultiIndex([('a', 'b', 'c')],
           )
>>> mi.levshape
(1, 1, 1)
property codes
set_codes(codes, *, level=None, verify_integrity=True)[source]

Set new codes on MultiIndex. Defaults to returning new index.

Parameters:
  • codes (sequence or list of sequence) – New codes to apply.

  • level (int, level name, or sequence of int/level names (default None)) – Level(s) to set (None for all levels).

  • verify_integrity (bool, default True) – If True, checks that levels and codes are compatible.

Returns:

The same type as the caller or None if inplace=True.

Return type:

new index (of same type and class…etc) or None

Examples

>>> idx = pd.MultiIndex.from_tuples(
...     [(1, "one"), (1, "two"), (2, "one"), (2, "two")], names=["foo", "bar"]
... )
>>> idx
MultiIndex([(1, 'one'),
    (1, 'two'),
    (2, 'one'),
    (2, 'two')],
   names=['foo', 'bar'])
>>> idx.set_codes([[1, 0, 1, 0], [0, 0, 1, 1]])
MultiIndex([(2, 'one'),
            (1, 'one'),
            (2, 'two'),
            (1, 'two')],
           names=['foo', 'bar'])
>>> idx.set_codes([1, 0, 1, 0], level=0)
MultiIndex([(2, 'one'),
            (1, 'two'),
            (2, 'one'),
            (1, 'two')],
           names=['foo', 'bar'])
>>> idx.set_codes([0, 0, 1, 1], level='bar')
MultiIndex([(1, 'one'),
            (1, 'one'),
            (2, 'two'),
            (2, 'two')],
           names=['foo', 'bar'])
>>> idx.set_codes([[1, 0, 1, 0], [0, 0, 1, 1]], level=[0, 1])
MultiIndex([(2, 'one'),
            (1, 'one'),
            (2, 'two'),
            (1, 'two')],
           names=['foo', 'bar'])
copy(names=None, deep=False, name=None)[source]

Make a copy of this object.

Names, dtype, levels and codes can be passed and will be set on new copy.

Parameters:
  • names (sequence, optional) –

  • deep (bool, default False) –

  • name (Label) – Kept for compatibility with 1-dimensional Index. Should not be used.

Return type:

MultiIndex

Notes

In most cases, there should be no functional difference from using deep, but if deep is passed it will attempt to deepcopy. This could be potentially expensive on large MultiIndex objects.

Examples

>>> mi = pd.MultiIndex.from_arrays([['a'], ['b'], ['c']])
>>> mi
MultiIndex([('a', 'b', 'c')],
           )
>>> mi.copy()
MultiIndex([('a', 'b', 'c')],
           )
view(cls=None)[source]

this is defined as a copy with the same identity

dtype
memory_usage(deep=False)[source]

Memory usage of the values.

Parameters:

deep (bool, default False) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption.

Return type:

bytes used

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of the array.

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False or if used on PyPy

nbytes

return the number of bytes in the underlying data

format(name=None, formatter=None, na_rep=None, names=False, space=2, sparsify=None, adjoin=True)[source]

Render a string representation of the Index.

Parameters:
  • name (bool | None) –

  • formatter (Callable | None) –

  • na_rep (str | None) –

  • names (bool) –

  • space (int) –

  • adjoin (bool) –

Return type:

list

property names: FrozenList

Names of levels in MultiIndex.

Examples

>>> mi = pd.MultiIndex.from_arrays(
... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z'])
>>> mi
MultiIndex([(1, 3, 5),
            (2, 4, 6)],
           names=['x', 'y', 'z'])
>>> mi.names
FrozenList(['x', 'y', 'z'])
inferred_type
is_monotonic_increasing

Return a boolean if the values are equal or increasing.

is_monotonic_decreasing

Return a boolean if the values are equal or decreasing.

duplicated(keep='first')[source]

Indicate duplicate index values.

Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the first, or all except the last occurrence of duplicates can be indicated.

Parameters:

keep ({'first', 'last', False}, default 'first') –

The value or values in a set of duplicates to mark as missing.

  • ’first’ : Mark duplicates as True except for the first occurrence.

  • ’last’ : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Return type:

np.ndarray[bool]

See also

Series.duplicated

Equivalent method on pandas.Series.

DataFrame.duplicated

Equivalent method on pandas.DataFrame.

Index.drop_duplicates

Remove duplicate values from Index.

Examples

By default, for each set of duplicated values, the first occurrence is set to False and all others to True:

>>> idx = pd.Index(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> idx.duplicated()
array([False, False,  True, False,  True])

which is equivalent to

>>> idx.duplicated(keep='first')
array([False, False,  True, False,  True])

By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:

>>> idx.duplicated(keep='last')
array([ True, False,  True, False, False])

By setting keep on False, all duplicates are True:

>>> idx.duplicated(keep=False)
array([ True, False,  True, False,  True])
fillna(value=None, downcast=None)[source]

fillna is not implemented for MultiIndex

dropna(how='any')[source]

Return Index without NA/NaN values.

Parameters:

how ({'any', 'all'}, default 'any') – If the Index is a MultiIndex, drop the value when any or all levels are NaN.

Return type:

Index

get_level_values(level)[source]

Return vector of label values for requested level.

Length of returned vector is equal to the length of the index.

Parameters:

level (int or str) – level is either the integer position of the level in the MultiIndex, or the name of the level.

Returns:

Values is a level of this MultiIndex converted to a single Index (or subclass thereof).

Return type:

Index

Notes

If the level contains missing values, the result may be casted to float with missing values specified as NaN. This is because the level is converted to a regular Index.

Examples

Create a MultiIndex:

>>> mi = pd.MultiIndex.from_arrays((list('abc'), list('def')))
>>> mi.names = ['level_1', 'level_2']

Get level values by supplying level as either integer or name:

>>> mi.get_level_values(0)
Index(['a', 'b', 'c'], dtype='object', name='level_1')
>>> mi.get_level_values('level_2')
Index(['d', 'e', 'f'], dtype='object', name='level_2')

If a level contains missing values, the return type of the level may be cast to float.

>>> pd.MultiIndex.from_arrays([[1, None, 2], [3, 4, 5]]).dtypes
level_0    int64
level_1    int64
dtype: object
>>> pd.MultiIndex.from_arrays([[1, None, 2], [3, 4, 5]]).get_level_values(0)
Index([1.0, nan, 2.0], dtype='float64')
unique(level=None)[source]

Return unique values in the index.

Unique values are returned in order of appearance, this does NOT sort.

Parameters:

level (int or hashable, optional) – Only return values from specified level (for MultiIndex). If int, gets the level by integer position, else by level name.

Return type:

Index

See also

unique

Numpy array of unique values in that column.

Series.unique

Return unique values of Series object.

to_frame(index=True, name=_NoDefault.no_default, allow_duplicates=False)[source]

Create a DataFrame with the levels of the MultiIndex as columns.

Column ordering is determined by the DataFrame constructor with data as a dict.

Parameters:
  • index (bool, default True) – Set the index of the returned DataFrame as the original MultiIndex.

  • name (list / sequence of str, optional) – The passed names should substitute index level names.

  • allow_duplicates (bool, optional default False) –

    Allow duplicate column labels to be created.

    New in version 1.5.0.

Return type:

DataFrame

See also

DataFrame

Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Examples

>>> mi = pd.MultiIndex.from_arrays([['a', 'b'], ['c', 'd']])
>>> mi
MultiIndex([('a', 'c'),
            ('b', 'd')],
           )
>>> df = mi.to_frame()
>>> df
     0  1
a c  a  c
b d  b  d
>>> df = mi.to_frame(index=False)
>>> df
   0  1
0  a  c
1  b  d
>>> df = mi.to_frame(name=['x', 'y'])
>>> df
     x  y
a c  a  c
b d  b  d
to_flat_index()[source]

Convert a MultiIndex to an Index of Tuples containing the level values.

Returns:

Index with the MultiIndex data represented in Tuples.

Return type:

pd.Index

See also

MultiIndex.from_tuples

Convert flat index back to MultiIndex.

Notes

This method will simply return the caller if called by anything other than a MultiIndex.

Examples

>>> index = pd.MultiIndex.from_product(
...     [['foo', 'bar'], ['baz', 'qux']],
...     names=['a', 'b'])
>>> index.to_flat_index()
Index([('foo', 'baz'), ('foo', 'qux'),
       ('bar', 'baz'), ('bar', 'qux')],
      dtype='object')
remove_unused_levels()[source]

Create new MultiIndex from current that removes unused levels.

Unused level(s) means levels that are not expressed in the labels. The resulting MultiIndex will have the same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.

Return type:

MultiIndex

Examples

>>> mi = pd.MultiIndex.from_product([range(2), list('ab')])
>>> mi
MultiIndex([(0, 'a'),
            (0, 'b'),
            (1, 'a'),
            (1, 'b')],
           )
>>> mi[2:]
MultiIndex([(1, 'a'),
            (1, 'b')],
           )

The 0 from the first level is not represented and can be removed

>>> mi2 = mi[2:].remove_unused_levels()
>>> mi2.levels
FrozenList([[1], ['a', 'b']])
take(indices, axis=0, allow_fill=True, fill_value=None, **kwargs)[source]

Return a new MultiIndex of the values selected by the indices.

For internal compatibility with numpy arrays.

Parameters:
  • indices (array-like) – Indices to be taken.

  • axis (int, optional) – The axis over which to select values, always 0.

  • allow_fill (bool, default True) –

  • fill_value (scalar, default None) – If allow_fill=True and fill_value is not None, indices specified by -1 are regarded as NA. If Index doesn’t hold NA, raise ValueError.

  • self (MultiIndex) –

Returns:

An index formed of elements at the given indices. Will be the same type as self, except for RangeIndex.

Return type:

Index

See also

numpy.ndarray.take

Return an array formed from the elements of a at the given indices.

append(other)[source]

Append a collection of Index options together.

Parameters:

other (Index or list/tuple of indices) –

Returns:

The combined index.

Return type:

Index

Examples

>>> mi = pd.MultiIndex.from_arrays([['a'], ['b']])
>>> mi
MultiIndex([('a', 'b')],
           )
>>> mi.append(mi)
MultiIndex([('a', 'b'), ('a', 'b')],
           )
argsort(*args, **kwargs)[source]

Return the integer indices that would sort the index.

Parameters:
  • *args – Passed to numpy.ndarray.argsort.

  • **kwargs – Passed to numpy.ndarray.argsort.

Returns:

Integer indices that would sort the index if used as an indexer.

Return type:

np.ndarray[np.intp]

See also

numpy.argsort

Similar method for NumPy arrays.

Index.sort_values

Return sorted copy of Index.

Examples

>>> idx = pd.Index(['b', 'a', 'd', 'c'])
>>> idx
Index(['b', 'a', 'd', 'c'], dtype='object')
>>> order = idx.argsort()
>>> order
array([1, 0, 3, 2])
>>> idx[order]
Index(['a', 'b', 'c', 'd'], dtype='object')
repeat(repeats, axis=None)[source]

Repeat elements of a MultiIndex.

Returns a new MultiIndex where each element of the current MultiIndex is repeated consecutively a given number of times.

Parameters:
  • repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty MultiIndex.

  • axis (None) – Must be None. Has no effect but is accepted for compatibility with numpy.

Returns:

Newly created MultiIndex with repeated elements.

Return type:

MultiIndex

See also

Series.repeat

Equivalent function for Series.

numpy.repeat

Similar method for numpy.ndarray.

Examples

>>> idx = pd.Index(['a', 'b', 'c'])
>>> idx
Index(['a', 'b', 'c'], dtype='object')
>>> idx.repeat(2)
Index(['a', 'a', 'b', 'b', 'c', 'c'], dtype='object')
>>> idx.repeat([1, 2, 3])
Index(['a', 'b', 'b', 'c', 'c', 'c'], dtype='object')
drop(codes, level=None, errors='raise')[source]

Make new MultiIndex with passed list of codes deleted.

Parameters:
  • codes (array-like) – Must be a list of tuples when level is not specified.

  • level (int or level name, default None) –

  • errors (str, default 'raise') –

Return type:

MultiIndex

swaplevel(i=-2, j=-1)[source]

Swap level i with level j.

Calling this method does not change the ordering of the values.

Parameters:
  • i (int, str, default -2) – First level of index to be swapped. Can pass level name as string. Type of parameters can be mixed.

  • j (int, str, default -1) – Second level of index to be swapped. Can pass level name as string. Type of parameters can be mixed.

Returns:

A new MultiIndex.

Return type:

MultiIndex

See also

Series.swaplevel

Swap levels i and j in a MultiIndex.

DataFrame.swaplevel

Swap levels i and j in a MultiIndex on a particular axis.

Examples

>>> mi = pd.MultiIndex(levels=[['a', 'b'], ['bb', 'aa']],
...                    codes=[[0, 0, 1, 1], [0, 1, 0, 1]])
>>> mi
MultiIndex([('a', 'bb'),
            ('a', 'aa'),
            ('b', 'bb'),
            ('b', 'aa')],
           )
>>> mi.swaplevel(0, 1)
MultiIndex([('bb', 'a'),
            ('aa', 'a'),
            ('bb', 'b'),
            ('aa', 'b')],
           )
reorder_levels(order)[source]

Rearrange levels using input order. May not drop or duplicate levels.

Parameters:

order (list of int or list of str) – List representing new level order. Reference level by number (position) or by key (label).

Return type:

MultiIndex

Examples

>>> mi = pd.MultiIndex.from_arrays([[1, 2], [3, 4]], names=['x', 'y'])
>>> mi
MultiIndex([(1, 3),
            (2, 4)],
           names=['x', 'y'])
>>> mi.reorder_levels(order=[1, 0])
MultiIndex([(3, 1),
            (4, 2)],
           names=['y', 'x'])
>>> mi.reorder_levels(order=['y', 'x'])
MultiIndex([(3, 1),
            (4, 2)],
           names=['y', 'x'])
sortlevel(level=0, ascending=True, sort_remaining=True)[source]

Sort MultiIndex at the requested level.

The result will respect the original ordering of the associated factor at that level.

Parameters:
  • level (list-like, int or str, default 0) – If a string is given, must be a name of the level. If list-like must be names or ints of levels.

  • ascending (bool, default True) – False to sort in descending order. Can also be a list to specify a directed ordering.

  • sort_remaining (sort by the remaining levels after level) –

Returns:

  • sorted_index (pd.MultiIndex) – Resulting index.

  • indexer (np.ndarray[np.intp]) – Indices of output values in original index.

Return type:

tuple[MultiIndex, npt.NDArray[np.intp]]

Examples

>>> mi = pd.MultiIndex.from_arrays([[0, 0], [2, 1]])
>>> mi
MultiIndex([(0, 2),
            (0, 1)],
           )
>>> mi.sortlevel()
(MultiIndex([(0, 1),
            (0, 2)],
           ), array([1, 0]))
>>> mi.sortlevel(sort_remaining=False)
(MultiIndex([(0, 2),
            (0, 1)],
           ), array([0, 1]))
>>> mi.sortlevel(1)
(MultiIndex([(0, 1),
            (0, 2)],
           ), array([1, 0]))
>>> mi.sortlevel(1, ascending=False)
(MultiIndex([(0, 2),
            (0, 1)],
           ), array([0, 1]))
get_slice_bound(label, side)[source]

For an ordered MultiIndex, compute slice bound that corresponds to given label.

Returns leftmost (one-past-the-rightmost if `side==’right’) position of given label.

Parameters:
  • label (object or tuple of objects) –

  • side ({'left', 'right'}) –

Returns:

Index of label.

Return type:

int

Notes

This method only works if level 0 index of the MultiIndex is lexsorted.

Examples

>>> mi = pd.MultiIndex.from_arrays([list('abbc'), list('gefd')])

Get the locations from the leftmost ‘b’ in the first level until the end of the multiindex:

>>> mi.get_slice_bound('b', side="left")
1

Like above, but if you get the locations from the rightmost ‘b’ in the first level and ‘f’ in the second level:

>>> mi.get_slice_bound(('b','f'), side="right")
3

See also

MultiIndex.get_loc

Get location for a label or a tuple of labels.

MultiIndex.get_locs

Get location for a label/slice/list/mask or a sequence of such.

slice_locs(start=None, end=None, step=None)[source]

For an ordered MultiIndex, compute the slice locations for input labels.

The input labels can be tuples representing partial levels, e.g. for a MultiIndex with 3 levels, you can pass a single value (corresponding to the first level), or a 1-, 2-, or 3-tuple.

Parameters:
  • start (label or tuple, default None) – If None, defaults to the beginning

  • end (label or tuple) – If None, defaults to the end

  • step (int or None) – Slice step

Returns:

(start, end)

Return type:

(int, int)

Notes

This method only works if the MultiIndex is properly lexsorted. So, if only the first 2 levels of a 3-level MultiIndex are lexsorted, you can only pass two levels to .slice_locs.

Examples

>>> mi = pd.MultiIndex.from_arrays([list('abbd'), list('deff')],
...                                names=['A', 'B'])

Get the slice locations from the beginning of ‘b’ in the first level until the end of the multiindex:

>>> mi.slice_locs(start='b')
(1, 4)

Like above, but stop at the end of ‘b’ in the first level and ‘f’ in the second level:

>>> mi.slice_locs(start='b', end=('b', 'f'))
(1, 3)

See also

MultiIndex.get_loc

Get location for a label or a tuple of labels.

MultiIndex.get_locs

Get location for a label/slice/list/mask or a sequence of such.

get_loc(key)[source]

Get location for a label or a tuple of labels.

The location is returned as an integer/slice or boolean mask.

Parameters:

key (label or tuple of labels (one for each level)) –

Returns:

If the key is past the lexsort depth, the return may be a boolean mask array, otherwise it is always a slice or int.

Return type:

int, slice object or boolean mask

See also

Index.get_loc

The get_loc method for (single-level) index.

MultiIndex.slice_locs

Get slice location given start label(s) and end label(s).

MultiIndex.get_locs

Get location for a label/slice/list/mask or a sequence of such.

Notes

The key cannot be a slice, list of same-level labels, a boolean mask, or a sequence of such. If you want to use those, use MultiIndex.get_locs() instead.

Examples

>>> mi = pd.MultiIndex.from_arrays([list('abb'), list('def')])
>>> mi.get_loc('b')
slice(1, 3, None)
>>> mi.get_loc(('b', 'e'))
1
get_loc_level(key, level=0, drop_level=True)[source]

Get location and sliced index for requested label(s)/level(s).

Parameters:
  • key (label or sequence of labels) –

  • level (int/level name or list thereof, optional) –

  • drop_level (bool, default True) – If False, the resulting index will not drop any level.

Returns:

A 2-tuple where the elements :

Element 0: int, slice object or boolean array.

Element 1: The resulting sliced multiindex/index. If the key contains all levels, this will be None.

Return type:

tuple

See also

MultiIndex.get_loc

Get location for a label or a tuple of labels.

MultiIndex.get_locs

Get location for a label/slice/list/mask or a sequence of such.

Examples

>>> mi = pd.MultiIndex.from_arrays([list('abb'), list('def')],
...                                names=['A', 'B'])
>>> mi.get_loc_level('b')
(slice(1, 3, None), Index(['e', 'f'], dtype='object', name='B'))
>>> mi.get_loc_level('e', level='B')
(array([False,  True, False]), Index(['b'], dtype='object', name='A'))
>>> mi.get_loc_level(['b', 'e'])
(1, None)
get_locs(seq)[source]

Get location for a sequence of labels.

Parameters:

seq (label, slice, list, mask or a sequence of such) – You should use one of the above for each level. If a level should not be used, set it to slice(None).

Returns:

NumPy array of integers suitable for passing to iloc.

Return type:

numpy.ndarray

See also

MultiIndex.get_loc

Get location for a label or a tuple of labels.

MultiIndex.slice_locs

Get slice location given start label(s) and end label(s).

Examples

>>> mi = pd.MultiIndex.from_arrays([list('abb'), list('def')])
>>> mi.get_locs('b')  
array([1, 2], dtype=int64)
>>> mi.get_locs([slice(None), ['e', 'f']])  
array([1, 2], dtype=int64)
>>> mi.get_locs([[True, False, True], slice('e', 'f')])  
array([2], dtype=int64)
truncate(before=None, after=None)[source]

Slice index between two labels / tuples, return new MultiIndex.

Parameters:
  • before (label or tuple, can be partial. Default None) – None defaults to start.

  • after (label or tuple, can be partial. Default None) – None defaults to end.

Returns:

The truncated MultiIndex.

Return type:

MultiIndex

Examples

>>> mi = pd.MultiIndex.from_arrays([['a', 'b', 'c'], ['x', 'y', 'z']])
>>> mi
MultiIndex([('a', 'x'), ('b', 'y'), ('c', 'z')],
           )
>>> mi.truncate(before='a', after='b')
MultiIndex([('a', 'x'), ('b', 'y')],
           )
equals(other)[source]

Determines if two MultiIndex objects have the same labeling information (the levels themselves do not necessarily have to be the same)

See also

equal_levels

Parameters:

other (object) –

Return type:

bool

equal_levels(other)[source]

Return True if the levels of both MultiIndex objects are the same

Parameters:

other (MultiIndex) –

Return type:

bool

astype(dtype, copy=True)[source]

Create an Index with values cast to dtypes.

The class of a new Index is determined by dtype. When conversion is impossible, a TypeError exception is raised.

Parameters:
  • dtype (numpy dtype or pandas type) – Note that any signed integer dtype is treated as 'int64', and any unsigned integer dtype is treated as 'uint64', regardless of the size.

  • copy (bool, default True) – By default, astype always returns a newly allocated object. If copy is set to False and internal requirements on dtype are satisfied, the original data is used to create a new Index or the original Index is returned.

Returns:

Index with values cast to specified dtype.

Return type:

Index

putmask(mask, value)[source]

Return a new MultiIndex of the values set with the mask.

Parameters:
  • mask (array like) –

  • value (MultiIndex) – Must either be the same length as self or length one

Return type:

MultiIndex

insert(loc, item)[source]

Make new MultiIndex inserting new item at location

Parameters:
  • loc (int) –

  • item (tuple) – Must be same length as number of levels in the MultiIndex

Returns:

new_index

Return type:

Index

delete(loc)[source]

Make new index with passed location deleted

Returns:

new_index

Return type:

MultiIndex

isin(values, level=None)[source]

Return a boolean array where the index values are in values.

Compute boolean array of whether each index value is found in the passed set of values. The length of the returned boolean array matches the length of the index.

Parameters:
  • values (set or list-like) – Sought values.

  • level (str or int, optional) – Name or position of the index level to use (if the index is a MultiIndex).

Returns:

NumPy array of boolean values.

Return type:

np.ndarray[bool]

See also

Series.isin

Same for Series.

DataFrame.isin

Same method for DataFrames.

Notes

In the case of MultiIndex you must either specify values as a list-like object containing tuples that are the same length as the number of levels, or specify level. Otherwise it will raise a ValueError.

If level is specified:

  • if it is the name of one and only one index level, use that level;

  • otherwise it should be a number indicating level position.

Examples

>>> idx = pd.Index([1,2,3])
>>> idx
Index([1, 2, 3], dtype='int64')

Check whether each index value in a list of values.

>>> idx.isin([1, 4])
array([ True, False, False])
>>> midx = pd.MultiIndex.from_arrays([[1,2,3],
...                                  ['red', 'blue', 'green']],
...                                  names=('number', 'color'))
>>> midx
MultiIndex([(1,   'red'),
            (2,  'blue'),
            (3, 'green')],
           names=['number', 'color'])

Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.

>>> midx.isin(['red', 'orange', 'yellow'], level='color')
array([ True, False, False])

To check across the levels of a MultiIndex, pass a list of tuples:

>>> midx.isin([(1, 'red'), (3, 'red')])
array([ True, False, False])

For a DatetimeIndex, string values in values are converted to Timestamps.

>>> dates = ['2000-03-11', '2000-03-12', '2000-03-13']
>>> dti = pd.to_datetime(dates)
>>> dti
DatetimeIndex(['2000-03-11', '2000-03-12', '2000-03-13'],
dtype='datetime64[ns]', freq=None)
>>> dti.isin(['2000-03-11'])
array([ True, False, False])
rename(names, *, level=None, inplace=False)

Set Index or MultiIndex name.

Able to set new names partially and by level.

Parameters:
  • names (label or list of label or dict-like for MultiIndex) –

    Name(s) to set.

    Changed in version 1.3.0.

  • level (int, label or list of int or label, optional) –

    If the index is a MultiIndex and names is not dict-like, level(s) to set (None for all levels). Otherwise level must be None.

    Changed in version 1.3.0.

  • inplace (bool, default False) – Modifies the object directly, instead of creating a new Index or MultiIndex.

  • self (_IndexT) –

Returns:

The same type as the caller or None if inplace=True.

Return type:

Index or None

See also

Index.rename

Able to set new names without level.

Examples

>>> idx = pd.Index([1, 2, 3, 4])
>>> idx
Index([1, 2, 3, 4], dtype='int64')
>>> idx.set_names('quarter')
Index([1, 2, 3, 4], dtype='int64', name='quarter')
>>> idx = pd.MultiIndex.from_product([['python', 'cobra'],
...                                   [2018, 2019]])
>>> idx
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           )
>>> idx = idx.set_names(['kind', 'year'])
>>> idx.set_names('species', level=0)
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           names=['species', 'year'])

When renaming levels with a dict, levels can not be passed.

>>> idx.set_names({'kind': 'snake'})
MultiIndex([('python', 2018),
            ('python', 2019),
            ( 'cobra', 2018),
            ( 'cobra', 2019)],
           names=['snake', 'year'])
class pandas.NamedAgg[source]

Helper for column specific aggregation with control over output column names.

Subclass of typing.NamedTuple.

Parameters:
  • column (Hashable) – Column label in the DataFrame to apply aggfunc.

  • aggfunc (function or str) – Function to apply to the provided column. If string, the name of a built-in pandas function.

Examples

>>> df = pd.DataFrame({"key": [1, 1, 2], "a": [-1, 0, 1], 1: [10, 11, 12]})
>>> agg_a = pd.NamedAgg(column="a", aggfunc="min")
>>> agg_1 = pd.NamedAgg(column=1, aggfunc=np.mean)
>>> df.groupby("key").agg(result_a=agg_a, result_1=agg_1)
     result_a  result_1
key
1          -1      10.5
2           1      12.0
column: Hashable

Alias for field number 0

aggfunc: str | Callable[[...], Any]

Alias for field number 1

class pandas.Period

Represents a period of time.

Parameters:
  • value (Period or str, default None) – The time period represented (e.g., ‘4Q2005’). This represents neither the start or the end of the period, but rather the entire period itself.

  • freq (str, default None) – One of pandas period strings or corresponding objects. Accepted strings are listed in the offset alias section in the user docs.

  • ordinal (int, default None) – The period offset from the proleptic Gregorian epoch.

  • year (int, default None) – Year value of the period.

  • month (int, default 1) – Month value of the period.

  • quarter (int, default None) – Quarter value of the period.

  • day (int, default 1) – Day value of the period.

  • hour (int, default 0) – Hour value of the period.

  • minute (int, default 0) – Minute value of the period.

  • second (int, default 0) – Second value of the period.

Examples

>>> period = pd.Period('2012-1-1', freq='D')
>>> period
Period('2012-01-01', 'D')
class pandas.PeriodDtype[source]

An ExtensionDtype for Period data.

This is not an actual numpy dtype, but a duck type.

Parameters:

freq (str or DateOffset) – The frequency of this PeriodDtype.

freq
None()

Examples

>>> pd.PeriodDtype(freq='D')
period[D]
>>> pd.PeriodDtype(freq=pd.offsets.MonthEnd())
period[M]
type

alias of Period

kind: str = 'O'
str: str = '|O08'
base: dtype | ExtensionDtype | None = dtype('O')
num = 102
property freq

The frequency object of this PeriodDtype.

classmethod construct_from_string(string)[source]

Strict construction from a string, raise a TypeError if not possible

Parameters:

string (str) –

Return type:

PeriodDtype

property name: str

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

property na_value: NaTType

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

classmethod is_dtype(dtype)[source]

Return a boolean if we if the passed type is an actual dtype that we can match (via string or type)

Parameters:

dtype (object) –

Return type:

bool

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

class pandas.PeriodIndex[source]

Immutable ndarray holding ordinal values indicating regular periods in time.

Index keys are boxed to Period objects which carries the metadata (eg, frequency information).

Parameters:
  • data (array-like (1d int np.ndarray or PeriodArray), optional) – Optional period-like data to construct index with.

  • copy (bool) – Make a copy of input ndarray.

  • freq (str or period object, optional) – One of pandas period strings or corresponding objects.

  • year (int, array, or Series, default None) –

  • month (int, array, or Series, default None) –

  • quarter (int, array, or Series, default None) –

  • day (int, array, or Series, default None) –

  • hour (int, array, or Series, default None) –

  • minute (int, array, or Series, default None) –

  • second (int, array, or Series, default None) –

  • dtype (str or PeriodDtype, default None) –

  • name (Hashable) –

Return type:

PeriodIndex

day
dayofweek
day_of_week
dayofyear
day_of_year
days_in_month
daysinmonth
end_time
freq
Type:

BaseOffset

freqstr
hour
is_leap_year
minute
month
quarter
qyear
second
start_time
week
weekday
weekofyear
year
asfreq()[source]
Parameters:

how (str) –

Return type:

PeriodIndex

strftime()
to_timestamp()[source]
Parameters:

how (str) –

Return type:

DatetimeIndex

See also

Index

The base pandas Index type.

Period

Represents a period of time.

DatetimeIndex

Index with datetime64 data.

TimedeltaIndex

Index of timedelta64 data.

period_range

Create a fixed-frequency PeriodIndex.

Examples

>>> idx = pd.PeriodIndex(year=[2000, 2002], quarter=[1, 3])
>>> idx
PeriodIndex(['2000Q1', '2002Q3'], dtype='period[Q-DEC]')
asfreq(freq=None, how='E')[source]

Convert the PeriodArray to the specified frequency freq.

Equivalent to applying pandas.Period.asfreq() with the given arguments to each Period in this PeriodArray.

Parameters:
  • freq (str) – A frequency.

  • how (str {'E', 'S'}, default 'E') –

    Whether the elements should be aligned to the end or start within pa period.

    • ’E’, ‘END’, or ‘FINISH’ for end,

    • ’S’, ‘START’, or ‘BEGIN’ for start.

    January 31st (‘END’) vs. January 1st (‘START’) for example.

Returns:

The transformed PeriodArray with the new frequency.

Return type:

PeriodArray

See also

pandas.arrays.PeriodArray.asfreq

Convert each Period in a PeriodArray to the given frequency.

Period.asfreq

Convert a Period object to the given frequency.

Examples

>>> pidx = pd.period_range('2010-01-01', '2015-01-01', freq='A')
>>> pidx
PeriodIndex(['2010', '2011', '2012', '2013', '2014', '2015'],
dtype='period[A-DEC]')
>>> pidx.asfreq('M')
PeriodIndex(['2010-12', '2011-12', '2012-12', '2013-12', '2014-12',
'2015-12'], dtype='period[M]')
>>> pidx.asfreq('M', how='S')
PeriodIndex(['2010-01', '2011-01', '2012-01', '2013-01', '2014-01',
'2015-01'], dtype='period[M]')
to_timestamp(freq=None, how='start')[source]

Cast to DatetimeArray/Index.

Parameters:
  • freq (str or DateOffset, optional) – Target frequency. The default is ‘D’ for week or longer, ‘S’ otherwise.

  • how ({'s', 'e', 'start', 'end'}) – Whether to use the start or end of the time period being converted.

Return type:

DatetimeArray/Index

property hour

The hour of the period.

property minute

The minute of the period.

property second

The second of the period.

property values: ndarray

Return an array representing the data in the Index.

Warning

We recommend using Index.array or Index.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.

Returns:

array

Return type:

numpy.ndarray or ExtensionArray

See also

Index.array

Reference to the underlying data.

Index.to_numpy

A NumPy array representing the underlying data.

asof_locs(where, mask)[source]

where : array of timestamps mask : np.ndarray[bool]

Array of booleans where data is not NA.

Parameters:
  • where (Index) –

  • mask (npt.NDArray[np.bool_]) –

Return type:

np.ndarray

property is_full: bool

Returns True if this PeriodIndex is range-like in that all Periods between start and end are present, in order.

property inferred_type: str

Return a string of the type inferred from the values.

get_loc(key)[source]

Get integer location for requested label.

Parameters:

key (Period, NaT, str, or datetime) – String or datetime key must be parsable as Period.

Returns:

loc

Return type:

int or ndarray[int64]

Raises:
  • KeyError – Key is not present in the index.

  • TypeError – If key is listlike or otherwise not hashable.

shift(periods=1, freq=None)[source]

Shift index by desired number of time frequency increments.

This method is for shifting the values of datetime-like indexes by a specified time increment a given number of times.

Parameters:
  • periods (int, default 1) – Number of periods (or increments) to shift by, can be positive or negative.

  • freq (pandas.DateOffset, pandas.Timedelta or string, optional) – Frequency increment to shift by. If None, the index is shifted by its own freq attribute. Offset aliases are valid strings, e.g., ‘D’, ‘W’, ‘M’ etc.

Returns:

Shifted index.

Return type:

pandas.DatetimeIndex

See also

Index.shift

Shift values of Index.

PeriodIndex.shift

Shift values of PeriodIndex.

property day

The days of the period.

property day_of_week

The day of the week with Monday=0, Sunday=6.

property day_of_year

The ordinal day of the year.

property dayofweek

The day of the week with Monday=0, Sunday=6.

property dayofyear

The ordinal day of the year.

property days_in_month

The number of days in the month.

property daysinmonth

The number of days in the month.

property end_time

Get the Timestamp for the end of the period.

Return type:

Timestamp

See also

Period.start_time

Return the start Timestamp.

Period.dayofyear

Return the day of year.

Period.daysinmonth

Return the days in that month.

Period.dayofweek

Return the day of the week.

property is_leap_year

Logical indicating if the date belongs to a leap year.

property month

The month as January=1, December=12.

property quarter

The quarter of the date.

property qyear
property start_time

Get the Timestamp for the start of the period.

Return type:

Timestamp

See also

Period.end_time

Return the end Timestamp.

Period.dayofyear

Return the day of year.

Period.daysinmonth

Return the days in that month.

Period.dayofweek

Return the day of the week.

Examples

>>> period = pd.Period('2012-1-1', freq='D')
>>> period
Period('2012-01-01', 'D')
>>> period.start_time
Timestamp('2012-01-01 00:00:00')
>>> period.end_time
Timestamp('2012-01-01 23:59:59.999999999')
strftime(*args, **kwargs)

Convert to Index using specified date_format.

Return an Index of formatted strings specified by date_format, which supports the same string format as the python standard library. Details of the string format can be found in python string format doc.

Formats supported by the C strftime API but not by the python string format doc (such as “%R”, “%r”) are not officially supported and should be preferably replaced with their supported equivalents (such as “%H:%M”, “%I:%M:%S %p”).

Note that PeriodIndex support additional directives, detailed in Period.strftime.

Parameters:

date_format (str) – Date format string (e.g. “%Y-%m-%d”).

Returns:

NumPy ndarray of formatted strings.

Return type:

ndarray[object]

See also

to_datetime

Convert the given argument to datetime.

DatetimeIndex.normalize

Return DatetimeIndex with times to midnight.

DatetimeIndex.round

Round the DatetimeIndex to the specified freq.

DatetimeIndex.floor

Floor the DatetimeIndex to the specified freq.

Timestamp.strftime

Format a single Timestamp.

Period.strftime

Format a single Period.

Examples

>>> rng = pd.date_range(pd.Timestamp("2018-03-10 09:00"),
...                     periods=3, freq='s')
>>> rng.strftime('%B %d, %Y, %r')
Index(['March 10, 2018, 09:00:00 AM', 'March 10, 2018, 09:00:01 AM',
       'March 10, 2018, 09:00:02 AM'],
      dtype='object')
property week

The week ordinal of the year.

property weekday

The day of the week with Monday=0, Sunday=6.

property weekofyear

The week ordinal of the year.

property year

The year of the period.

class pandas.RangeIndex[source]

Immutable Index implementing a monotonic integer range.

RangeIndex is a memory-saving special case of an Index limited to representing monotonic ranges with a 64-bit dtype. Using RangeIndex may in some instances improve computing speed.

This is the default index type used by DataFrame and Series when no explicit index is provided by the user.

Parameters:
  • start (int (default: 0), range, or other RangeIndex instance) – If int and “stop” is not given, interpreted as “stop” instead.

  • stop (int (default: 0)) –

  • step (int (default: 1)) –

  • dtype (np.int64) – Unused, accepted for homogeneity with other index types.

  • copy (bool, default False) – Unused, accepted for homogeneity with other index types.

  • name (object, optional) – Name to be stored in the index.

Return type:

RangeIndex

start
stop
step
from_range()[source]
Parameters:
  • data (range) –

  • dtype (Dtype | None) –

Return type:

RangeIndex

See also

Index

The base pandas Index type.

classmethod from_range(data, name=None, dtype=None)[source]

Create RangeIndex from a range object.

Return type:

RangeIndex

Parameters:
  • data (range) –

  • dtype (Dtype | None) –

property start: int

The value of the start parameter (0 if this was not supplied).

property stop: int

The value of the stop parameter.

property step: int

The value of the step parameter (1 if this was not supplied).

nbytes

Return the number of bytes in the underlying data.

memory_usage(deep=False)[source]

Memory usage of my values

Parameters:

deep (bool) – Introspect the data deeply, interrogate object dtypes for system-level memory consumption

Return type:

bytes used

Notes

Memory usage does not include memory consumed by elements that are not components of the array if deep=False

See also

numpy.ndarray.nbytes

property dtype: dtype

Return the dtype object of the underlying data.

property is_unique: bool

return if the index has unique values

is_monotonic_increasing
is_monotonic_decreasing
property inferred_type: str

Return a string of the type inferred from the values.

get_loc(key)[source]

Get integer location, slice or boolean mask for requested label.

Parameters:

key (label) –

Return type:

int if unique index, slice if monotonic index, else mask

Examples

>>> unique_index = pd.Index(list('abc'))
>>> unique_index.get_loc('b')
1
>>> monotonic_index = pd.Index(list('abbc'))
>>> monotonic_index.get_loc('b')
slice(1, 3, None)
>>> non_monotonic_index = pd.Index(list('abcb'))
>>> non_monotonic_index.get_loc('b')
array([False,  True, False,  True])
tolist()[source]

Return a list of the values.

These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Timestamp/Timedelta/Interval/Period)

Return type:

list

See also

numpy.ndarray.tolist

Return the array as an a.ndim-levels deep nested list of Python scalars.

copy(name=None, deep=False)[source]

Make a copy of this object.

Name is set on the new object.

Parameters:
  • name (Label, optional) – Set name for new object.

  • deep (bool, default False) –

Returns:

Index refer to new object which is a copy of this object.

Return type:

Index

Notes

In most cases, there should be no functional difference from using deep, but if deep is passed it will attempt to deepcopy.

min(axis=None, skipna=True, *args, **kwargs)[source]

The minimum value of the RangeIndex

Parameters:

skipna (bool) –

Return type:

int

max(axis=None, skipna=True, *args, **kwargs)[source]

The maximum value of the RangeIndex

Parameters:

skipna (bool) –

Return type:

int

argsort(*args, **kwargs)[source]

Returns the indices that would sort the index and its underlying data.

Return type:

np.ndarray[np.intp]

See also

numpy.ndarray.argsort

factorize(sort=False, use_na_sentinel=True)[source]

Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

Parameters:
  • sort (bool, default False) – Sort uniques and shuffle codes to maintain the relationship.

  • use_na_sentinel (bool, default True) –

    If True, the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values.

    New in version 1.5.0.

Returns:

  • codes (ndarray) – An integer ndarray that’s an indexer into uniques. uniques.take(codes) will have the same values as values.

  • uniques (ndarray, Index, or Categorical) – The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

    Note

    Even if there’s a missing value in values, uniques will not contain an entry for it.

Return type:

tuple[npt.NDArray[np.intp], RangeIndex]

See also

cut

Discretize continuous-valued array.

unique

Find the unique value in an array.

Notes

Reference the user guide for more examples.

Examples

These examples all show factorize as a top-level method like pd.factorize(values). The results are identical for methods like Series.factorize().

>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> codes
array([0, 0, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

With sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the maintained.

>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> codes
array([1, 1, 0, 2, 1])
>>> uniques
array(['a', 'b', 'c'], dtype=object)

When use_na_sentinel=True (the default), missing values are indicated in the codes with the sentinel value -1 and missing values are not included in uniques.

>>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> codes
array([ 0, -1,  1,  2,  0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.

>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> codes, uniques = pd.factorize(cat)
>>> codes
array([0, 0, 1])
>>> uniques
['a', 'c']
Categories (3, object): ['a', 'b', 'c']

Notice that 'b' is in uniques.categories, despite not being present in cat.values.

For all other pandas objects, an Index of the appropriate type is returned.

>>> cat = pd.Series(['a', 'a', 'c'])
>>> codes, uniques = pd.factorize(cat)
>>> codes
array([0, 0, 1])
>>> uniques
Index(['a', 'c'], dtype='object')

If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting use_na_sentinel=False.

>>> values = np.array([1, 2, 1, np.nan])
>>> codes, uniques = pd.factorize(values)  # default: use_na_sentinel=True
>>> codes
array([ 0,  1,  0, -1])
>>> uniques
array([1., 2.])
>>> codes, uniques = pd.factorize(values, use_na_sentinel=False)
>>> codes
array([0, 1, 0, 2])
>>> uniques
array([ 1.,  2., nan])
equals(other)[source]

Determines if two Index objects contain the same elements.

Parameters:

other (object) –

Return type:

bool

sort_values(return_indexer=False, ascending=True, na_position='last', key=None)[source]

Return a sorted copy of the index.

Return a sorted copy of the index, and optionally return the indices that sorted the index itself.

Parameters:
  • return_indexer (bool, default False) – Should the indices that would sort the index be returned.

  • ascending (bool, default True) – Should the index values be sorted in an ascending order.

  • na_position ({'first' or 'last'}, default 'last') –

    Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.

    New in version 1.2.0.

  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape.

    New in version 1.1.0.

Returns:

  • sorted_index (pandas.Index) – Sorted copy of the index.

  • indexer (numpy.ndarray, optional) – The indices that the index itself was sorted by.

See also

Series.sort_values

Sort values of a Series.

DataFrame.sort_values

Sort values in a DataFrame.

Examples

>>> idx = pd.Index([10, 100, 1, 1000])
>>> idx
Index([10, 100, 1, 1000], dtype='int64')

Sort values in ascending order (default behavior).

>>> idx.sort_values()
Index([1, 10, 100, 1000], dtype='int64')

Sort values in descending order, and also get the indices idx was sorted by.

>>> idx.sort_values(ascending=False, return_indexer=True)
(Index([1000, 100, 10, 1], dtype='int64'), array([3, 1, 0, 2]))
symmetric_difference(other, result_name=None, sort=None)[source]

Compute the symmetric difference of two Index objects.

Parameters:
  • other (Index or array-like) –

  • result_name (str) –

  • sort (bool or None, default None) –

    Whether to sort the resulting index. By default, the values are attempted to be sorted, but any TypeError from incomparable elements is caught by pandas.

    • None : Attempt to sort the result, but catch any TypeErrors from comparing incomparable elements.

    • False : Do not sort the result.

    • True : Sort the result (which may raise TypeError).

Return type:

Index

Notes

symmetric_difference contains elements that appear in either idx1 or idx2 but not both. Equivalent to the Index created by idx1.difference(idx2) | idx2.difference(idx1) with duplicates dropped.

Examples

>>> idx1 = pd.Index([1, 2, 3, 4])
>>> idx2 = pd.Index([2, 3, 4, 5])
>>> idx1.symmetric_difference(idx2)
Index([1, 5], dtype='int64')
delete(loc)[source]

Make new Index with passed location(-s) deleted.

Parameters:

loc (int or list of int) – Location of item(-s) which will be deleted. Use a list of locations to delete more than one value at the same time.

Returns:

Will be same type as self, except for RangeIndex.

Return type:

Index

See also

numpy.delete

Delete any rows and column from NumPy array (ndarray).

Examples

>>> idx = pd.Index(['a', 'b', 'c'])
>>> idx.delete(1)
Index(['a', 'c'], dtype='object')
>>> idx = pd.Index(['a', 'b', 'c'])
>>> idx.delete([0, 2])
Index(['b'], dtype='object')
insert(loc, item)[source]

Make new Index inserting new item at location.

Follows Python numpy.insert semantics for negative values.

Parameters:
Return type:

Index

property size: int

Return the number of elements in the underlying data.

all(*args, **kwargs)[source]

Return whether all elements are Truthy.

Parameters:
  • *args – Required for compatibility with numpy.

  • **kwargs – Required for compatibility with numpy.

Returns:

A single element array-like may be converted to bool.

Return type:

bool or array-like (if axis is specified)

See also

Index.any

Return whether any element in an Index is True.

Series.any

Return whether any element in a Series is True.

Series.all

Return whether all elements in a Series are True.

Notes

Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.

Examples

True, because nonzero integers are considered True.

>>> pd.Index([1, 2, 3]).all()
True

False, because 0 is considered False.

>>> pd.Index([0, 1, 2]).all()
False
any(*args, **kwargs)[source]

Return whether any element is Truthy.

Parameters:
  • *args – Required for compatibility with numpy.

  • **kwargs – Required for compatibility with numpy.

Returns:

A single element array-like may be converted to bool.

Return type:

bool or array-like (if axis is specified)

See also

Index.all

Return whether all elements are True.

Series.all

Return whether all elements are True.

Notes

Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal to zero.

Examples

>>> index = pd.Index([0, 1, 2])
>>> index.any()
True
>>> index = pd.Index([0, 0, 0])
>>> index.any()
False
class pandas.Series[source]

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

Parameters:
  • data (array-like, Iterable, dict, or scalar value) – Contains data stored in Series. If data is a dict, argument order is maintained.

  • index (array-like or Index (1d)) – Values must be hashable and have the same length as data. Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, …, n) if not provided. If data is dict-like and index is None, then the keys in the data are used as the index. If the index is not None, the resulting Series is reindexed with the index values.

  • dtype (str, numpy.dtype, or ExtensionDtype, optional) – Data type for the output Series. If not specified, this will be inferred from data. See the user guide for more usages.

  • name (Hashable, default None) – The name to give to the Series.

  • copy (bool, default False) – Copy input data. Only affects Series or 1d ndarray input. See examples.

  • fastpath (bool) –

Notes

Please reference the User Guide for more information.

Examples

Constructing Series from a dictionary with an Index specified

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['a', 'b', 'c'])
>>> ser
a   1
b   2
c   3
dtype: int64

The keys of the dictionary match with the Index values, hence the Index values have no effect.

>>> d = {'a': 1, 'b': 2, 'c': 3}
>>> ser = pd.Series(data=d, index=['x', 'y', 'z'])
>>> ser
x   NaN
y   NaN
z   NaN
dtype: float64

Note that the Index is first build with the keys from the dictionary. After this the Series is reindexed with the given Index values, hence we get all NaN as a result.

Constructing Series from a list with copy=False.

>>> r = [1, 2]
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
[1, 2]
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a copy of the original data even though copy=False, so the data is unchanged.

Constructing Series from a 1d ndarray with copy=False.

>>> r = np.array([1, 2])
>>> ser = pd.Series(r, copy=False)
>>> ser.iloc[0] = 999
>>> r
array([999,   2])
>>> ser
0    999
1      2
dtype: int64

Due to input data type the Series has a view on the original data, so the data is changed as well.

property hasnans: bool

Return True if there are any NaNs.

Enables various performance speedups.

Return type:

bool

div(other, level=None, fill_value=None, axis=0)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
rdiv(other, level=None, fill_value=None, axis=0)

Return Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.truediv

Element-wise Floating division, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
property dtype: DtypeObj

Return the dtype object of the underlying data.

Examples

>>> s = pd.Series([1, 2, 3])
>>> s.dtype
dtype('int64')
property dtypes: DtypeObj

Return the dtype object of the underlying data.

Examples

>>> s = pd.Series([1, 2, 3])
>>> s.dtypes
dtype('int64')
property name: Hashable

Return the name of the Series.

The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used whenever displaying the Series using the interpreter.

Returns:

The name of the Series, also the column name if part of a DataFrame.

Return type:

label (hashable object)

See also

Series.rename

Sets the Series name when given a scalar input.

Index.name

Corresponding Index property.

Examples

The Series name can be set initially when calling the constructor.

>>> s = pd.Series([1, 2, 3], dtype=np.int64, name='Numbers')
>>> s
0    1
1    2
2    3
Name: Numbers, dtype: int64
>>> s.name = "Integers"
>>> s
0    1
1    2
2    3
Name: Integers, dtype: int64

The name of a Series within a DataFrame is its column name.

>>> df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
...                   columns=["Odd Numbers", "Even Numbers"])
>>> df
   Odd Numbers  Even Numbers
0            1             2
1            3             4
2            5             6
>>> df["Even Numbers"].name
'Even Numbers'
property values

Return Series as ndarray or ndarray-like depending on the dtype.

Warning

We recommend using Series.array or Series.to_numpy(), depending on whether you need a reference to the underlying data or a NumPy array.

Return type:

numpy.ndarray or ndarray-like

See also

Series.array

Reference to the underlying data.

Series.to_numpy

A NumPy array representing the underlying data.

Examples

>>> pd.Series([1, 2, 3]).values
array([1, 2, 3])
>>> pd.Series(list('aabc')).values
array(['a', 'a', 'b', 'c'], dtype=object)
>>> pd.Series(list('aabc')).astype('category').values
['a', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']

Timezone aware datetime data is converted to UTC:

>>> pd.Series(pd.date_range('20130101', periods=3,
...                         tz='US/Eastern')).values
array(['2013-01-01T05:00:00.000000000',
       '2013-01-02T05:00:00.000000000',
       '2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
property array: ExtensionArray

The ExtensionArray of the data backing this Series or Index.

Returns:

An ExtensionArray of the values stored within. For extension types, this is the actual array. For NumPy native types, this is a thin (no copy) wrapper around numpy.ndarray.

.array differs .values which may require converting the data to a different form.

Return type:

ExtensionArray

See also

Index.to_numpy

Similar method that always returns a NumPy array.

Series.to_numpy

Similar method that always returns a NumPy array.

Notes

This table lays out the different array types for each extension dtype within pandas.

dtype

array type

category

Categorical

period

PeriodArray

interval

IntervalArray

IntegerNA

IntegerArray

string

StringArray

boolean

BooleanArray

datetime64[ns, tz]

DatetimeArray

For any 3rd-party extension types, the array type will be an ExtensionArray.

For all remaining dtypes .array will be a arrays.NumpyExtensionArray wrapping the actual ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then use Series.to_numpy() instead.

Examples

For regular NumPy types like int, and float, a PandasArray is returned.

>>> pd.Series([1, 2, 3]).array
<PandasArray>
[1, 2, 3]
Length: 3, dtype: int64

For extension types, like Categorical, the actual ExtensionArray is returned

>>> ser = pd.Series(pd.Categorical(['a', 'b', 'a']))
>>> ser.array
['a', 'b', 'a']
Categories (2, object): ['a', 'b']
ravel(order='C')[source]

Return the flattened underlying data as an ndarray or ExtensionArray.

Returns:

Flattened data of the Series.

Return type:

numpy.ndarray or ExtensionArray

Parameters:

order (str) –

See also

numpy.ndarray.ravel

Return a flattened array.

view(dtype=None)[source]

Create a new view of the Series.

This function will return a new Series with a view of the same underlying values in memory, optionally reinterpreted with a new data type. The new data type must preserve the same size in bytes as to not cause index misalignment.

Parameters:

dtype (data type) – Data type object or one of their string representations.

Returns:

A new Series object as a view of the same data in memory.

Return type:

Series

See also

numpy.ndarray.view

Equivalent numpy function to create a new view of the same data in memory.

Notes

Series are instantiated with dtype=float64 by default. While numpy.ndarray.view() will return a view with the same data type as the original array, Series.view() (without specified dtype) will try using float64 and may fail if the original data type size in bytes is not the same.

Examples

>>> s = pd.Series([-2, -1, 0, 1, 2], dtype='int8')
>>> s
0   -2
1   -1
2    0
3    1
4    2
dtype: int8

The 8 bit signed integer representation of -1 is 0b11111111, but the same bytes represent 255 if read as an 8 bit unsigned integer:

>>> us = s.view('uint8')
>>> us
0    254
1    255
2      0
3      1
4      2
dtype: uint8

The views share the same underlying values:

>>> us[0] = 128
>>> s
0   -128
1     -1
2      0
3      1
4      2
dtype: int8
property axes: list[pandas.core.indexes.base.Index]

Return a list of the row axis labels.

take(indices, axis=0, **kwargs)[source]

Return the elements in the given positional indices along an axis.

This means that we are not indexing according to actual values in the index attribute of the object. We are indexing according to the actual position of the element in the object.

Parameters:
  • indices (array-like) – An array of ints indicating which positions to take.

  • axis ({0 or 'index', 1 or 'columns', None}, default 0) – The axis on which to select elements. 0 means that we are selecting rows, 1 means that we are selecting columns. For Series this parameter is unused and defaults to 0.

  • **kwargs – For compatibility with numpy.take(). Has no effect on the output.

Returns:

An array-like containing the elements taken from the object.

Return type:

same type as caller

See also

DataFrame.loc

Select a subset of a DataFrame by labels.

DataFrame.iloc

Select a subset of a DataFrame by positions.

numpy.take

Take elements from an array along an axis.

Examples

>>> df = pd.DataFrame([('falcon', 'bird', 389.0),
...                    ('parrot', 'bird', 24.0),
...                    ('lion', 'mammal', 80.5),
...                    ('monkey', 'mammal', np.nan)],
...                   columns=['name', 'class', 'max_speed'],
...                   index=[0, 2, 3, 1])
>>> df
     name   class  max_speed
0  falcon    bird      389.0
2  parrot    bird       24.0
3    lion  mammal       80.5
1  monkey  mammal        NaN

Take elements at positions 0 and 3 along the axis 0 (default).

Note how the actual indices selected (0 and 1) do not correspond to our selected indices 0 and 3. That’s because we are selecting the 0th and 3rd rows, not rows whose indices equal 0 and 3.

>>> df.take([0, 3])
     name   class  max_speed
0  falcon    bird      389.0
1  monkey  mammal        NaN

Take elements at indices 1 and 2 along the axis 1 (column selection).

>>> df.take([1, 2], axis=1)
    class  max_speed
0    bird      389.0
2    bird       24.0
3  mammal       80.5
1  mammal        NaN

We may take elements using negative integers for positive indices, starting from the end of the object, just like with Python lists.

>>> df.take([-1, -2])
     name   class  max_speed
1  monkey  mammal        NaN
3    lion  mammal       80.5
repeat(repeats, axis=None)[source]

Repeat elements of a Series.

Returns a new Series where each element of the current Series is repeated consecutively a given number of times.

Parameters:
  • repeats (int or array of ints) – The number of repetitions for each element. This should be a non-negative integer. Repeating 0 times will return an empty Series.

  • axis (None) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

Newly created Series with repeated elements.

Return type:

Series

See also

Index.repeat

Equivalent function for Index.

numpy.repeat

Similar method for numpy.ndarray.

Examples

>>> s = pd.Series(['a', 'b', 'c'])
>>> s
0    a
1    b
2    c
dtype: object
>>> s.repeat(2)
0    a
0    a
1    b
1    b
2    c
2    c
dtype: object
>>> s.repeat([1, 2, 3])
0    a
1    b
1    b
2    c
2    c
2    c
dtype: object
reset_index(level: Hashable | Sequence[Hashable] = None, *, drop: Literal[False] = False, name: Hashable = _NoDefault.no_default, inplace: Literal[False] = False, allow_duplicates: bool = False) DataFrame[source]
reset_index(level: Hashable | Sequence[Hashable] = None, *, drop: Literal[True], name: Hashable = _NoDefault.no_default, inplace: Literal[False] = False, allow_duplicates: bool = False) Series
reset_index(level: Hashable | Sequence[Hashable] = None, *, drop: bool = False, name: Hashable = _NoDefault.no_default, inplace: Literal[True], allow_duplicates: bool = False) None

Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or when the index is meaningless and needs to be reset to the default before another operation.

Parameters:
  • level (int, str, tuple, or list, default optional) – For a Series with a MultiIndex, only remove the specified levels from the index. Removes all levels by default.

  • drop (bool, default False) – Just reset the index, without inserting it as a column in the new DataFrame.

  • name (object, optional) – The name to use for the column containing the original Series values. Uses self.name by default. This argument is ignored when drop is True.

  • inplace (bool, default False) – Modify the Series in place (do not create a new object).

  • allow_duplicates (bool, default False) –

    Allow duplicate column labels to be created.

    New in version 1.5.0.

Returns:

When drop is False (the default), a DataFrame is returned. The newly created columns will come first in the DataFrame, followed by the original Series values. When drop is True, a Series is returned. In either case, if inplace=True, no value is returned.

Return type:

Series or DataFrame or None

See also

DataFrame.reset_index

Analogous function for DataFrame.

Examples

>>> s = pd.Series([1, 2, 3, 4], name='foo',
...               index=pd.Index(['a', 'b', 'c', 'd'], name='idx'))

Generate a DataFrame with default index.

>>> s.reset_index()
  idx  foo
0   a    1
1   b    2
2   c    3
3   d    4

To specify the name of the new column use name.

>>> s.reset_index(name='values')
  idx  values
0   a       1
1   b       2
2   c       3
3   d       4

To generate a new Series with the default set drop to True.

>>> s.reset_index(drop=True)
0    1
1    2
2    3
3    4
Name: foo, dtype: int64

The level parameter is interesting for Series with a multi-level index.

>>> arrays = [np.array(['bar', 'bar', 'baz', 'baz']),
...           np.array(['one', 'two', 'one', 'two'])]
>>> s2 = pd.Series(
...     range(4), name='foo',
...     index=pd.MultiIndex.from_arrays(arrays,
...                                     names=['a', 'b']))

To remove a specific level from the Index, use level.

>>> s2.reset_index(level='a')
       a  foo
b
one  bar    0
two  bar    1
one  baz    2
two  baz    3

If level is not set, all levels are removed from the Index.

>>> s2.reset_index()
     a    b  foo
0  bar  one    0
1  bar  two    1
2  baz  one    2
3  baz  two    3
to_string(buf: None = None, na_rep: str = 'NaN', float_format: str | None = None, header: bool = True, index: bool = True, length=False, dtype=False, name=False, max_rows: int | None = None, min_rows: int | None = None) str[source]
to_string(buf: FilePath | WriteBuffer[str], na_rep: str = 'NaN', float_format: str | None = None, header: bool = True, index: bool = True, length=False, dtype=False, name=False, max_rows: int | None = None, min_rows: int | None = None) None

Render a string representation of the Series.

Parameters:
  • buf (StringIO-like, optional) – Buffer to write to.

  • na_rep (str, optional) – String representation of NaN to use, default ‘NaN’.

  • float_format (one-parameter function, optional) – Formatter function to apply to columns’ elements if they are floats, default None.

  • header (bool, default True) – Add the Series header (index name).

  • index (bool, optional) – Add index (row) labels, default True.

  • length (bool, default False) – Add the Series length.

  • dtype (bool, default False) – Add the Series dtype.

  • name (bool, default False) – Add the Series name if not None.

  • max_rows (int, optional) – Maximum number of rows to show before truncating. If None, show all.

  • min_rows (int, optional) – The number of rows to display in a truncated repr (when number of rows is above max_rows).

Returns:

String representation of Series if buf=None, otherwise None.

Return type:

str or None

to_markdown(buf=None, mode='wt', index=True, storage_options=None, **kwargs)[source]

Print Series in Markdown-friendly format.

Parameters:
  • buf (str, Path or StringIO-like, optional, default None) – Buffer to write to. If None, the output is returned as a string.

  • mode (str, optional) – Mode in which file is opened, “wt” by default.

  • index (bool, optional, default True) –

    Add index (row) labels.

    New in version 1.1.0.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • **kwargs

    These parameters will be passed to tabulate.

Returns:

Series in Markdown-friendly format.

Return type:

str

Notes

Requires the tabulate package.

Examples
>>> s = pd.Series(["elk", "pig", "dog", "quetzal"], name="animal")
>>> print(s.to_markdown())
|    | animal   |
|---:|:---------|
|  0 | elk      |
|  1 | pig      |
|  2 | dog      |
|  3 | quetzal  |

Output markdown with a tabulate option.

>>> print(s.to_markdown(tablefmt="grid"))
+----+----------+
|    | animal   |
+====+==========+
|  0 | elk      |
+----+----------+
|  1 | pig      |
+----+----------+
|  2 | dog      |
+----+----------+
|  3 | quetzal  |
+----+----------+
items()[source]

Lazily iterate over (index, value) tuples.

This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.

Returns:

Iterable of tuples containing the (index, value) pairs from a Series.

Return type:

iterable

See also

DataFrame.items

Iterate over (column name, Series) pairs.

DataFrame.iterrows

Iterate over DataFrame rows as (index, Series) pairs.

Examples

>>> s = pd.Series(['A', 'B', 'C'])
>>> for index, value in s.items():
...     print(f"Index : {index}, Value : {value}")
Index : 0, Value : A
Index : 1, Value : B
Index : 2, Value : C
keys()[source]

Return alias for index.

Returns:

Index of the Series.

Return type:

Index

to_dict(into=<class 'dict'>)[source]

Convert Series to {label -> value} dict or dict-like object.

Parameters:

into (class, default dict) – The collections.abc.Mapping subclass to use as the return object. Can be the actual class or an empty instance of the mapping type you want. If you want a collections.defaultdict, you must pass it initialized.

Returns:

Key-value representation of Series.

Return type:

collections.abc.Mapping

Examples

>>> s = pd.Series([1, 2, 3, 4])
>>> s.to_dict()
{0: 1, 1: 2, 2: 3, 3: 4}
>>> from collections import OrderedDict, defaultdict
>>> s.to_dict(OrderedDict)
OrderedDict([(0, 1), (1, 2), (2, 3), (3, 4)])
>>> dd = defaultdict(list)
>>> s.to_dict(dd)
defaultdict(<class 'list'>, {0: 1, 1: 2, 2: 3, 3: 4})
to_frame(name=_NoDefault.no_default)[source]

Convert Series to DataFrame.

Parameters:

name (object, optional) – The passed name should substitute for the series name (if it has one).

Returns:

DataFrame representation of Series.

Return type:

DataFrame

Examples

>>> s = pd.Series(["a", "b", "c"],
...               name="vals")
>>> s.to_frame()
  vals
0    a
1    b
2    c
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, observed=False, dropna=True)[source]

Group Series using a mapper or by a Series of columns.

A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.

Parameters:
  • by (mapping, function, label, pd.Grouper or list of such) –

    Used to determine the groups for the groupby. If by is a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see .align() method). If a list or ndarray of length equal to the selected axis is passed (see the groupby user guide), the values are used as-is to determine the groups. A label or list of labels may be passed to group by the columns in self. Notice that a tuple is interpreted as a (single) key.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Split along rows (0) or columns (1). For Series this parameter is unused and defaults to 0.

  • level (int, level name, or sequence of such, default None) – If the axis is a MultiIndex (hierarchical), group by a particular level or levels. Do not specify both by and level.

  • as_index (bool, default True) – For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.

  • sort (bool, default True) –

    Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.

    Changed in version 2.0.0: Specifying sort=False with an ordered categorical grouper will no longer sort the values.

  • group_keys (bool, default True) –

    When calling apply and the by argument produces a like-indexed (i.e. a transform) result, add group keys to index to identify pieces. By default group keys are not included when the result’s index (and column) labels match the inputs, and are included otherwise.

    Changed in version 1.5.0: Warns that group_keys will no longer be ignored when the result from apply is a like-indexed Series or DataFrame. Specify group_keys explicitly to include the group keys or not.

    Changed in version 2.0.0: group_keys now defaults to True.

  • observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

  • dropna (bool, default True) –

    If True, and if group keys contain NA values, NA values together with row/column will be dropped. If False, NA values will also be treated as the key in groups.

    New in version 1.1.0.

Returns:

Returns a groupby object that contains information about the groups.

Return type:

SeriesGroupBy

See also

resample

Convenience method for frequency conversion and resampling of time series.

Notes

See the user guide for more detailed usage and examples, including splitting an object into groups, iterating through groups, selecting a group, aggregation, and more.

Examples

>>> ser = pd.Series([390., 350., 30., 20.],
...                 index=['Falcon', 'Falcon', 'Parrot', 'Parrot'], name="Max Speed")
>>> ser
Falcon    390.0
Falcon    350.0
Parrot     30.0
Parrot     20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", "b"]).mean()
a    210.0
b    185.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100).mean()
Max Speed
False     25.0
True     370.0
Name: Max Speed, dtype: float64

Grouping by Indexes

We can groupby different levels of a hierarchical index using the level parameter:

>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
...           ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed")
>>> ser
Animal  Type
Falcon  Captive    390.0
        Wild       350.0
Parrot  Captive     30.0
        Wild        20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()
Animal
Falcon    370.0
Parrot     25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level="Type").mean()
Type
Captive    210.0
Wild       185.0
Name: Max Speed, dtype: float64

We can also choose to include NA in group keys or not by defining dropna parameter, the default setting is True.

>>> ser = pd.Series([1, 2, 3, 3], index=["a", 'a', 'b', np.nan])
>>> ser.groupby(level=0).sum()
a    3
b    3
dtype: int64
>>> ser.groupby(level=0, dropna=False).sum()
a    3
b    3
NaN  3
dtype: int64
>>> arrays = ['Falcon', 'Falcon', 'Parrot', 'Parrot']
>>> ser = pd.Series([390., 350., 30., 20.], index=arrays, name="Max Speed")
>>> ser.groupby(["a", "b", "a", np.nan]).mean()
a    210.0
b    350.0
Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", np.nan], dropna=False).mean()
a    210.0
b    350.0
NaN   20.0
Name: Max Speed, dtype: float64
count()[source]

Return number of non-NA/null observations in the Series.

Returns:

Number of non-null values in the Series.

Return type:

int or Series (if level specified)

See also

DataFrame.count

Count non-NA cells for each column or row.

Examples

>>> s = pd.Series([0.0, 1.0, np.nan])
>>> s.count()
2
mode(dropna=True)[source]

Return the mode(s) of the Series.

The mode is the value that appears most often. There can be multiple modes.

Always returns Series even if only one value is returned.

Parameters:

dropna (bool, default True) – Don’t consider counts of NaN/NaT.

Returns:

Modes of the Series in sorted order.

Return type:

Series

unique()[source]

Return unique values of Series object.

Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.

Returns:

The unique values returned as a NumPy array. See Notes.

Return type:

ndarray or ExtensionArray

See also

Series.drop_duplicates

Return Series with duplicate values removed.

unique

Top-level unique method for any 1-d array-like object.

Index.unique

Return Index with unique values from an Index object.

Notes

Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new ExtensionArray of that type with just the unique values is returned. This includes

  • Categorical

  • Period

  • Datetime with Timezone

  • Datetime without Timezone

  • Timedelta

  • Interval

  • Sparse

  • IntegerNA

See Examples section.

Examples

>>> pd.Series([2, 1, 3, 3], name='A').unique()
array([2, 1, 3])
>>> pd.Series([pd.Timestamp('2016-01-01') for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00']
Length: 1, dtype: datetime64[ns]
>>> pd.Series([pd.Timestamp('2016-01-01', tz='US/Eastern')
...            for _ in range(3)]).unique()
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]

An Categorical will return categories in the order of appearance and with the same dtype.

>>> pd.Series(pd.Categorical(list('baabc'))).unique()
['b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> pd.Series(pd.Categorical(list('baabc'), categories=list('abc'),
...                          ordered=True)).unique()
['b', 'a', 'c']
Categories (3, object): ['a' < 'b' < 'c']
drop_duplicates(*, keep: Literal['first', 'last', False] = 'first', inplace: Literal[False] = False, ignore_index: bool = False) Series[source]
drop_duplicates(*, keep: Literal['first', 'last', False] = 'first', inplace: Literal[True], ignore_index: bool = False) None
drop_duplicates(*, keep: Literal['first', 'last', False] = 'first', inplace: bool = False, ignore_index: bool = False) Series | None

Return Series with duplicate values removed.

Parameters:
  • keep ({‘first’, ‘last’, False}, default ‘first’) –

    Method to handle dropping duplicates:

    • ’first’ : Drop duplicates except for the first occurrence.

    • ’last’ : Drop duplicates except for the last occurrence.

    • False : Drop all duplicates.

  • inplace (bool, default False) – If True, performs operation inplace and returns None.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 2.0.0.

Returns:

Series with duplicates dropped or None if inplace=True.

Return type:

Series or None

See also

Index.drop_duplicates

Equivalent method on Index.

DataFrame.drop_duplicates

Equivalent method on DataFrame.

Series.duplicated

Related method on Series, indicating duplicate Series values.

Series.unique

Return unique values as an array.

Examples

Generate a Series with duplicated entries.

>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama', 'hippo'],
...               name='animal')
>>> s
0      lama
1       cow
2      lama
3    beetle
4      lama
5     hippo
Name: animal, dtype: object

With the ‘keep’ parameter, the selection behaviour of duplicated values can be changed. The value ‘first’ keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.

>>> s.drop_duplicates()
0      lama
1       cow
3    beetle
5     hippo
Name: animal, dtype: object

The value ‘last’ for parameter ‘keep’ keeps the last occurrence for each set of duplicated entries.

>>> s.drop_duplicates(keep='last')
1       cow
3    beetle
4      lama
5     hippo
Name: animal, dtype: object

The value False for parameter ‘keep’ discards all sets of duplicated entries.

>>> s.drop_duplicates(keep=False)
1       cow
3    beetle
5     hippo
Name: animal, dtype: object
duplicated(keep='first')[source]

Indicate duplicate Series values.

Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except the first or all except the last occurrence of duplicates can be indicated.

Parameters:

keep ({'first', 'last', False}, default 'first') –

Method to handle dropping duplicates:

  • ’first’ : Mark duplicates as True except for the first occurrence.

  • ’last’ : Mark duplicates as True except for the last occurrence.

  • False : Mark all duplicates as True.

Returns:

Series indicating whether each value has occurred in the preceding values.

Return type:

Series[bool]

See also

Index.duplicated

Equivalent method on pandas.Index.

DataFrame.duplicated

Equivalent method on pandas.DataFrame.

Series.drop_duplicates

Remove duplicate values from Series.

Examples

By default, for each set of duplicated values, the first occurrence is set on False and all others on True:

>>> animals = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama'])
>>> animals.duplicated()
0    False
1    False
2     True
3    False
4     True
dtype: bool

which is equivalent to

>>> animals.duplicated(keep='first')
0    False
1    False
2     True
3    False
4     True
dtype: bool

By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:

>>> animals.duplicated(keep='last')
0     True
1    False
2     True
3    False
4    False
dtype: bool

By setting keep on False, all duplicates are True:

>>> animals.duplicated(keep=False)
0     True
1    False
2     True
3    False
4     True
dtype: bool
idxmin(axis=0, skipna=True, *args, **kwargs)[source]

Return the row label of the minimum value.

If multiple values equal the minimum, the first row label with that value is returned.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • skipna (bool, default True) – Exclude NA/null values. If the entire Series is NA, the result will be NA.

  • *args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Label of the minimum value.

Return type:

Index

Raises:

ValueError – If the Series is empty.

See also

numpy.argmin

Return indices of the minimum values along the given axis.

DataFrame.idxmin

Return index of first occurrence of minimum over requested axis.

Series.idxmax

Return index label of the first occurrence of maximum of values.

Notes

This method is the Series version of ndarray.argmin. This method returns the label of the minimum, while ndarray.argmin returns the position. To get the position, use series.values.argmin().

Examples

>>> s = pd.Series(data=[1, None, 4, 1],
...               index=['A', 'B', 'C', 'D'])
>>> s
A    1.0
B    NaN
C    4.0
D    1.0
dtype: float64
>>> s.idxmin()
'A'

If skipna is False and there is an NA value in the data, the function returns nan.

>>> s.idxmin(skipna=False)
nan
idxmax(axis=0, skipna=True, *args, **kwargs)[source]

Return the row label of the maximum value.

If multiple values equal the maximum, the first row label with that value is returned.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • skipna (bool, default True) – Exclude NA/null values. If the entire Series is NA, the result will be NA.

  • *args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Label of the maximum value.

Return type:

Index

Raises:

ValueError – If the Series is empty.

See also

numpy.argmax

Return indices of the maximum values along the given axis.

DataFrame.idxmax

Return index of first occurrence of maximum over requested axis.

Series.idxmin

Return index label of the first occurrence of minimum of values.

Notes

This method is the Series version of ndarray.argmax. This method returns the label of the maximum, while ndarray.argmax returns the position. To get the position, use series.values.argmax().

Examples

>>> s = pd.Series(data=[1, None, 4, 3, 4],
...               index=['A', 'B', 'C', 'D', 'E'])
>>> s
A    1.0
B    NaN
C    4.0
D    3.0
E    4.0
dtype: float64
>>> s.idxmax()
'C'

If skipna is False and there is an NA value in the data, the function returns nan.

>>> s.idxmax(skipna=False)
nan
round(decimals=0, *args, **kwargs)[source]

Round each value in a Series to the given number of decimals.

Parameters:
  • decimals (int, default 0) – Number of decimal places to round to. If decimals is negative, it specifies the number of positions to the left of the decimal point.

  • *args – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional arguments and keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Rounded values of the Series.

Return type:

Series

See also

numpy.around

Round values of an np.array.

DataFrame.round

Round values of a DataFrame.

Examples

>>> s = pd.Series([0.1, 1.3, 2.7])
>>> s.round()
0    0.0
1    1.0
2    3.0
dtype: float64
quantile(q: float = 0.5, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') float[source]
quantile(q: Sequence[float] | ExtensionArray | ndarray | Index | Series, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') Series
quantile(q: float | Sequence[float] | ExtensionArray | ndarray | Index | Series = 0.5, interpolation: Literal['linear', 'lower', 'higher', 'midpoint', 'nearest'] = 'linear') float | Series

Return value at the given quantile.

Parameters:
  • q (float or array-like, default 0.5 (50% quantile)) – The quantile(s) to compute, which can lie in range: 0 <= q <= 1.

  • interpolation ({'linear', 'lower', 'higher', 'midpoint', 'nearest'}) –

    This optional parameter specifies the interpolation method to use, when the desired quantile lies between two data points i and j:

    • linear: i + (j - i) * fraction, where fraction is the fractional part of the index surrounded by i and j.

    • lower: i.

    • higher: j.

    • nearest: i or j whichever is nearest.

    • midpoint: (i + j) / 2.

Returns:

If q is an array, a Series will be returned where the index is q and the values are the quantiles, otherwise a float will be returned.

Return type:

float or Series

See also

core.window.Rolling.quantile

Calculate the rolling quantile.

numpy.percentile

Returns the q-th percentile(s) of the array elements.

Examples

>>> s = pd.Series([1, 2, 3, 4])
>>> s.quantile(.5)
2.5
>>> s.quantile([.25, .5, .75])
0.25    1.75
0.50    2.50
0.75    3.25
dtype: float64
corr(other, method='pearson', min_periods=None)[source]

Compute correlation with other Series, excluding missing values.

The two Series objects are not required to be the same length and will be aligned internally before the correlation function is applied.

Parameters:
  • other (Series) – Series with which to compute the correlation.

  • method ({'pearson', 'kendall', 'spearman'} or callable) –

    Method used to compute correlation:

    • pearson : Standard correlation coefficient

    • kendall : Kendall Tau correlation coefficient

    • spearman : Spearman rank correlation

    • callable: Callable with input two 1d ndarrays and returning a float.

    Warning

    Note that the returned matrix from corr will have 1 along the diagonals and will be symmetric regardless of the callable’s behavior.

  • min_periods (int, optional) – Minimum number of observations needed to have a valid result.

Returns:

Correlation with other.

Return type:

float

See also

DataFrame.corr

Compute pairwise correlation between columns.

DataFrame.corrwith

Compute pairwise correlation with another DataFrame or Series.

Notes

Pearson, Kendall and Spearman correlation are currently computed using pairwise complete observations.

Examples

>>> def histogram_intersection(a, b):
...     v = np.minimum(a, b).sum().round(decimals=1)
...     return v
>>> s1 = pd.Series([.2, .0, .6, .2])
>>> s2 = pd.Series([.3, .6, .0, .1])
>>> s1.corr(s2, method=histogram_intersection)
0.3
cov(other, min_periods=None, ddof=1)[source]

Compute covariance with Series, excluding missing values.

The two Series objects are not required to be the same length and will be aligned internally before the covariance is calculated.

Parameters:
  • other (Series) – Series with which to compute the covariance.

  • min_periods (int, optional) – Minimum number of observations needed to have a valid result.

  • ddof (int, default 1) –

    Delta degrees of freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

    New in version 1.1.0.

Returns:

Covariance between Series and other normalized by N-1 (unbiased estimator).

Return type:

float

See also

DataFrame.cov

Compute pairwise covariance of columns.

Examples

>>> s1 = pd.Series([0.90010907, 0.13484424, 0.62036035])
>>> s2 = pd.Series([0.12528585, 0.26962463, 0.51111198])
>>> s1.cov(s2)
-0.01685762652715874
diff(periods=1)[source]

First discrete difference of element.

Calculates the difference of a Series element compared with another element in the Series (default is element in previous row).

Parameters:

periods (int, default 1) – Periods to shift for calculating difference, accepts negative values.

Returns:

First differences of the Series.

Return type:

Series

See also

Series.pct_change

Percent change over given number of periods.

Series.shift

Shift index by desired number of periods with an optional time freq.

DataFrame.diff

First discrete difference of object.

Notes

For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated according to current dtype in Series, however dtype of the result is always float64.

Examples

Difference with previous row

>>> s = pd.Series([1, 1, 2, 3, 5, 8])
>>> s.diff()
0    NaN
1    0.0
2    1.0
3    1.0
4    2.0
5    3.0
dtype: float64

Difference with 3rd previous row

>>> s.diff(periods=3)
0    NaN
1    NaN
2    NaN
3    2.0
4    4.0
5    6.0
dtype: float64

Difference with following row

>>> s.diff(periods=-1)
0    0.0
1   -1.0
2   -1.0
3   -2.0
4   -3.0
5    NaN
dtype: float64

Overflow in input dtype

>>> s = pd.Series([1, 0], dtype=np.uint8)
>>> s.diff()
0      NaN
1    255.0
dtype: float64
autocorr(lag=1)[source]

Compute the lag-N autocorrelation.

This method computes the Pearson correlation between the Series and its shifted self.

Parameters:

lag (int, default 1) – Number of lags to apply before performing autocorrelation.

Returns:

The Pearson correlation between self and self.shift(lag).

Return type:

float

See also

Series.corr

Compute the correlation between two Series.

Series.shift

Shift index by desired number of periods.

DataFrame.corr

Compute pairwise correlation of columns.

DataFrame.corrwith

Compute pairwise correlation between rows or columns of two DataFrame objects.

Notes

If the Pearson correlation is not well defined return ‘NaN’.

Examples

>>> s = pd.Series([0.25, 0.5, 0.2, -0.05])
>>> s.autocorr()  
0.10355...
>>> s.autocorr(lag=2)  
-0.99999...

If the Pearson correlation is not well defined, then ‘NaN’ is returned.

>>> s = pd.Series([1, 0, 0, 0])
>>> s.autocorr()
nan
dot(other)[source]

Compute the dot product between the Series and the columns of other.

This method computes the dot product between the Series and another one, or the Series and each columns of a DataFrame, or the Series and each columns of an array.

It can also be called using self @ other in Python >= 3.5.

Parameters:

other (Series, DataFrame or array-like) – The other object to compute the dot product with its columns.

Returns:

Return the dot product of the Series and other if other is a Series, the Series of the dot product of Series and each rows of other if other is a DataFrame or a numpy.ndarray between the Series and each columns of the numpy array.

Return type:

scalar, Series or numpy.ndarray

See also

DataFrame.dot

Compute the matrix product with the DataFrame.

Series.mul

Multiplication of series and other, element-wise.

Notes

The Series and other has to share the same index if other is a Series or a DataFrame.

Examples

>>> s = pd.Series([0, 1, 2, 3])
>>> other = pd.Series([-1, 2, -3, 4])
>>> s.dot(other)
8
>>> s @ other
8
>>> df = pd.DataFrame([[0, 1], [-2, 3], [4, -5], [6, 7]])
>>> s.dot(df)
0    24
1    14
dtype: int64
>>> arr = np.array([[0, 1], [-2, 3], [4, -5], [6, 7]])
>>> s.dot(arr)
array([24, 14])
searchsorted(value, side='left', sorter=None)[source]

Find indices where elements should be inserted to maintain order.

Find the indices into a sorted Series self such that, if the corresponding elements in value were inserted before the indices, the order of self would be preserved.

Note

The Series must be monotonically sorted, otherwise wrong locations will likely be returned. Pandas does not check this for you.

Parameters:
  • value (array-like or scalar) – Values to insert into self.

  • side ({'left', 'right'}, optional) – If ‘left’, the index of the first suitable location found is given. If ‘right’, return the last such index. If there is no suitable index, return either 0 or N (where N is the length of self).

  • sorter (1-D array-like, optional) – Optional array of integer indices that sort self into ascending order. They are typically the result of np.argsort.

Returns:

A scalar or array of insertion points with the same shape as value.

Return type:

int or array of int

See also

sort_values

Sort by the values along either axis.

numpy.searchsorted

Similar method from NumPy.

Notes

Binary search is used to find the required insertion points.

Examples

>>> ser = pd.Series([1, 2, 3])
>>> ser
0    1
1    2
2    3
dtype: int64
>>> ser.searchsorted(4)
3
>>> ser.searchsorted([0, 4])
array([0, 3])
>>> ser.searchsorted([1, 3], side='left')
array([0, 2])
>>> ser.searchsorted([1, 3], side='right')
array([1, 3])
>>> ser = pd.Series(pd.to_datetime(['3/11/2000', '3/12/2000', '3/13/2000']))
>>> ser
0   2000-03-11
1   2000-03-12
2   2000-03-13
dtype: datetime64[ns]
>>> ser.searchsorted('3/14/2000')
3
>>> ser = pd.Categorical(
...     ['apple', 'bread', 'bread', 'cheese', 'milk'], ordered=True
... )
>>> ser
['apple', 'bread', 'bread', 'cheese', 'milk']
Categories (4, object): ['apple' < 'bread' < 'cheese' < 'milk']
>>> ser.searchsorted('bread')
1
>>> ser.searchsorted(['bread'], side='right')
array([3])

If the values are not monotonically sorted, wrong locations may be returned:

>>> ser = pd.Series([2, 1, 3])
>>> ser
0    2
1    1
2    3
dtype: int64
>>> ser.searchsorted(1)  
0  # wrong result, correct would be 1
compare(other, align_axis=1, keep_shape=False, keep_equal=False, result_names=('self', 'other'))[source]

Compare to another Series and show the differences.

New in version 1.1.0.

Parameters:
  • other (Series) – Object to compare with.

  • align_axis ({0 or 'index', 1 or 'columns'}, default 1) –

    Determine which axis to align the comparison on.

    • 0, or ‘index’Resulting differences are stacked vertically

      with rows drawn alternately from self and other.

    • 1, or ‘columns’Resulting differences are aligned horizontally

      with columns drawn alternately from self and other.

  • keep_shape (bool, default False) – If true, all rows and columns are kept. Otherwise, only the ones with different values are kept.

  • keep_equal (bool, default False) – If true, the result keeps values that are equal. Otherwise, equal values are shown as NaNs.

  • result_names (tuple, default ('self', 'other')) –

    Set the dataframes names in the comparison.

    New in version 1.5.0.

Returns:

If axis is 0 or ‘index’ the result will be a Series. The resulting index will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.

If axis is 1 or ‘columns’ the result will be a DataFrame. It will have two columns namely ‘self’ and ‘other’.

Return type:

Series or DataFrame

See also

DataFrame.compare

Compare with another DataFrame and show differences.

Notes

Matching NaNs will not appear as a difference.

Examples

>>> s1 = pd.Series(["a", "b", "c", "d", "e"])
>>> s2 = pd.Series(["a", "a", "c", "b", "e"])

Align the differences on columns

>>> s1.compare(s2)
  self other
1    b     a
3    d     b

Stack the differences on indices

>>> s1.compare(s2, align_axis=0)
1  self     b
   other    a
3  self     d
   other    b
dtype: object

Keep all original rows

>>> s1.compare(s2, keep_shape=True)
  self other
0  NaN   NaN
1    b     a
2  NaN   NaN
3    d     b
4  NaN   NaN

Keep all original rows and also all original values

>>> s1.compare(s2, keep_shape=True, keep_equal=True)
  self other
0    a     a
1    b     a
2    c     c
3    d     b
4    e     e
combine(other, func, fill_value=None)[source]

Combine the Series with a Series or scalar according to func.

Combine the Series and other using func to perform elementwise selection for combined Series. fill_value is assumed when value is missing at some index from one of the two objects being combined.

Parameters:
  • other (Series or scalar) – The value(s) to be combined with the Series.

  • func (function) – Function that takes two scalars as inputs and returns an element.

  • fill_value (scalar, optional) – The value to assume when an index is missing from one Series or the other. The default specifies to use the appropriate NaN value for the underlying dtype of the Series.

Returns:

The result of combining the Series with the other object.

Return type:

Series

See also

Series.combine_first

Combine Series values, choosing the calling Series’ values first.

Examples

Consider 2 Datasets s1 and s2 containing highest clocked speeds of different birds.

>>> s1 = pd.Series({'falcon': 330.0, 'eagle': 160.0})
>>> s1
falcon    330.0
eagle     160.0
dtype: float64
>>> s2 = pd.Series({'falcon': 345.0, 'eagle': 200.0, 'duck': 30.0})
>>> s2
falcon    345.0
eagle     200.0
duck       30.0
dtype: float64

Now, to combine the two datasets and view the highest speeds of the birds across the two datasets

>>> s1.combine(s2, max)
duck        NaN
eagle     200.0
falcon    345.0
dtype: float64

In the previous example, the resulting value for duck is missing, because the maximum of a NaN and a float is a NaN. So, in the example, we set fill_value=0, so the maximum value returned will be the value from some dataset.

>>> s1.combine(s2, max, fill_value=0)
duck       30.0
eagle     200.0
falcon    345.0
dtype: float64
combine_first(other)[source]

Update null elements with value in the same location in ‘other’.

Combine two Series objects by filling null values in one Series with non-null values from the other Series. Result index will be the union of the two indexes.

Parameters:

other (Series) – The value(s) to be used for filling null values.

Returns:

The result of combining the provided Series with the other object.

Return type:

Series

See also

Series.combine

Perform element-wise operation on two Series using a given function.

Examples

>>> s1 = pd.Series([1, np.nan])
>>> s2 = pd.Series([3, 4, 5])
>>> s1.combine_first(s2)
0    1.0
1    4.0
2    5.0
dtype: float64

Null values still persist if the location of that null value does not exist in other

>>> s1 = pd.Series({'falcon': np.nan, 'eagle': 160.0})
>>> s2 = pd.Series({'eagle': 200.0, 'duck': 30.0})
>>> s1.combine_first(s2)
duck       30.0
eagle     160.0
falcon      NaN
dtype: float64
update(other)[source]

Modify Series in place using values from passed Series.

Uses non-NA values from passed Series to make updates. Aligns on index.

Parameters:

other (Series, or object coercible into Series) –

Return type:

None

Examples

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6]))
>>> s
0    4
1    5
2    6
dtype: int64
>>> s = pd.Series(['a', 'b', 'c'])
>>> s.update(pd.Series(['d', 'e'], index=[0, 2]))
>>> s
0    d
1    b
2    e
dtype: object
>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, 5, 6, 7, 8]))
>>> s
0    4
1    5
2    6
dtype: int64

If other contains NaNs the corresponding values are not updated in the original Series.

>>> s = pd.Series([1, 2, 3])
>>> s.update(pd.Series([4, np.nan, 6]))
>>> s
0    4
1    2
2    6
dtype: int64

other can also be a non-Series object type that is coercible into a Series

>>> s = pd.Series([1, 2, 3])
>>> s.update([4, np.nan, 6])
>>> s
0    4
1    2
2    6
dtype: int64
>>> s = pd.Series([1, 2, 3])
>>> s.update({1: 9})
>>> s
0    1
1    9
2    3
dtype: int64
sort_values(*, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending: bool | int | Sequence[bool] | Sequence[int] = True, inplace: Literal[False] = False, kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) Series[source]
sort_values(*, axis: int | Literal['index', 'columns', 'rows'] = 0, ascending: bool | int | Sequence[bool] | Sequence[int] = True, inplace: Literal[True], kind: str = 'quicksort', na_position: str = 'last', ignore_index: bool = False, key: Callable[[Series], Series | ExtensionArray | ndarray | Index] | None = None) None

Sort by the values.

Sort a Series in ascending or descending order by some criterion.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • ascending (bool or list of bools, default True) – If True, sort values in ascending order, otherwise descending.

  • inplace (bool, default False) – If True, perform operation in-place.

  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also numpy.sort() for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms.

  • na_position ({'first' or 'last'}, default 'last') – Argument ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

  • key (callable, optional) –

    If not None, apply the key function to the series values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect a Series and return an array-like.

    New in version 1.1.0.

Returns:

Series ordered by values or None if inplace=True.

Return type:

Series or None

See also

Series.sort_index

Sort by the Series indices.

DataFrame.sort_values

Sort DataFrame by the values along either axis.

DataFrame.sort_index

Sort DataFrame by indices.

Examples

>>> s = pd.Series([np.nan, 1, 3, 10, 5])
>>> s
0     NaN
1     1.0
2     3.0
3     10.0
4     5.0
dtype: float64

Sort values ascending order (default behaviour)

>>> s.sort_values(ascending=True)
1     1.0
2     3.0
4     5.0
3    10.0
0     NaN
dtype: float64

Sort values descending order

>>> s.sort_values(ascending=False)
3    10.0
4     5.0
2     3.0
1     1.0
0     NaN
dtype: float64

Sort values putting NAs first

>>> s.sort_values(na_position='first')
0     NaN
1     1.0
2     3.0
4     5.0
3    10.0
dtype: float64

Sort a series of strings

>>> s = pd.Series(['z', 'b', 'd', 'a', 'c'])
>>> s
0    z
1    b
2    d
3    a
4    c
dtype: object
>>> s.sort_values()
3    a
1    b
4    c
2    d
0    z
dtype: object

Sort using a key function. Your key function will be given the Series of values and should return an array-like.

>>> s = pd.Series(['a', 'B', 'c', 'D', 'e'])
>>> s.sort_values()
1    B
3    D
0    a
2    c
4    e
dtype: object
>>> s.sort_values(key=lambda x: x.str.lower())
0    a
1    B
2    c
3    D
4    e
dtype: object

NumPy ufuncs work well here. For example, we can sort by the sin of the value

>>> s = pd.Series([-4, -2, 0, 2, 4])
>>> s.sort_values(key=np.sin)
1   -2
4    4
2    0
0   -4
3    2
dtype: int64

More complicated user-defined functions can be used, as long as they expect a Series and return an array-like

>>> s.sort_values(key=lambda x: (np.tan(x.cumsum())))
0   -4
3    2
4    4
1   -2
2    0
dtype: int64
sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[True], kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) None[source]
sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: Literal[False] = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) Series
sort_index(*, axis: int | Literal['index', 'columns', 'rows'] = 0, level: Hashable | Sequence[Hashable] = None, ascending: bool | Sequence[bool] = True, inplace: bool = False, kind: Literal['quicksort', 'mergesort', 'heapsort', 'stable'] = 'quicksort', na_position: Literal['first', 'last'] = 'last', sort_remaining: bool = True, ignore_index: bool = False, key: Callable[[Index], Index | ExtensionArray | ndarray | Series] | None = None) Series | None

Sort Series by index labels.

Returns a new Series sorted by label if inplace argument is False, otherwise updates the original series and returns None.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • level (int, optional) – If not None, sort on values in specified index level(s).

  • ascending (bool or list-like of bools, default True) – Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

  • inplace (bool, default False) – If True, perform operation in-place.

  • kind ({'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See also numpy.sort() for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label.

  • na_position ({'first', 'last'}, default 'last') – If ‘first’ puts NaNs at the beginning, ‘last’ puts NaNs at the end. Not implemented for MultiIndex.

  • sort_remaining (bool, default True) – If True and sorting by level and index is multilevel, sort by other levels too (in order) after sorting by specified level.

  • ignore_index (bool, default False) – If True, the resulting axis will be labeled 0, 1, …, n - 1.

  • key (callable, optional) –

    If not None, apply the key function to the index values before sorting. This is similar to the key argument in the builtin sorted() function, with the notable difference that this key function should be vectorized. It should expect an Index and return an Index of the same shape.

    New in version 1.1.0.

Returns:

The original Series sorted by the labels or None if inplace=True.

Return type:

Series or None

See also

DataFrame.sort_index

Sort DataFrame by the index.

DataFrame.sort_values

Sort DataFrame by the value.

Series.sort_values

Sort Series by the value.

Examples

>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, 4])
>>> s.sort_index()
1    c
2    b
3    a
4    d
dtype: object

Sort Descending

>>> s.sort_index(ascending=False)
4    d
3    a
2    b
1    c
dtype: object

By default NaNs are put at the end, but use na_position to place them at the beginning

>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, np.nan])
>>> s.sort_index(na_position='first')
NaN     d
 1.0    c
 2.0    b
 3.0    a
dtype: object

Specify index level to sort

>>> arrays = [np.array(['qux', 'qux', 'foo', 'foo',
...                     'baz', 'baz', 'bar', 'bar']),
...           np.array(['two', 'one', 'two', 'one',
...                     'two', 'one', 'two', 'one'])]
>>> s = pd.Series([1, 2, 3, 4, 5, 6, 7, 8], index=arrays)
>>> s.sort_index(level=1)
bar  one    8
baz  one    6
foo  one    4
qux  one    2
bar  two    7
baz  two    5
foo  two    3
qux  two    1
dtype: int64

Does not sort by remaining levels when sorting by levels

>>> s.sort_index(level=1, sort_remaining=False)
qux  one    2
foo  one    4
baz  one    6
bar  one    8
qux  two    1
foo  two    3
baz  two    5
bar  two    7
dtype: int64

Apply a key function before sorting

>>> s = pd.Series([1, 2, 3, 4], index=['A', 'b', 'C', 'd'])
>>> s.sort_index(key=lambda x : x.str.lower())
A    1
b    2
C    3
d    4
dtype: int64
argsort(axis=0, kind='quicksort', order=None)[source]

Return the integer indices that would sort the Series values.

Override ndarray.argsort. Argsorts the value, omitting NA/null values, and places the result in the same locations as the non-NA values.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • kind ({'mergesort', 'quicksort', 'heapsort', 'stable'}, default 'quicksort') – Choice of sorting algorithm. See numpy.sort() for more information. ‘mergesort’ and ‘stable’ are the only stable algorithms.

  • order (None) – Has no effect but is accepted for compatibility with numpy.

Returns:

Positions of values within the sort order with -1 indicating nan values.

Return type:

Series[np.intp]

See also

numpy.ndarray.argsort

Returns the indices that would sort this array.

nlargest(n=5, keep='first')[source]

Return the largest n elements.

Parameters:
  • n (int, default 5) – Return this many descending sorted values.

  • keep ({'first', 'last', 'all'}, default 'first') –

    When there are duplicate values that cannot all fit in a Series of n elements:

    • first : return the first n occurrences in order of appearance.

    • last : return the last n occurrences in reverse order of appearance.

    • all : keep all occurrences. This can result in a Series of size larger than n.

Returns:

The n largest values in the Series, sorted in decreasing order.

Return type:

Series

See also

Series.nsmallest

Get the n smallest elements.

Series.sort_values

Sort Series by values.

Series.head

Return the first n rows.

Notes

Faster than .sort_values(ascending=False).head(n) for small n relative to the size of the Series object.

Examples

>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Malta": 434000, "Maldives": 434000,
...                         "Brunei": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Malta         434000
Maldives      434000
Brunei        434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The n largest elements where n=5 by default.

>>> s.nlargest()
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64

The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.

>>> s.nlargest(3)
France    65000000
Italy     59000000
Malta       434000
dtype: int64

The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the last with value 434000 based on the index order.

>>> s.nlargest(3, keep='last')
France      65000000
Italy       59000000
Brunei        434000
dtype: int64

The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five elements due to the three duplicates.

>>> s.nlargest(3, keep='all')
France      65000000
Italy       59000000
Malta         434000
Maldives      434000
Brunei        434000
dtype: int64
nsmallest(n=5, keep='first')[source]

Return the smallest n elements.

Parameters:
  • n (int, default 5) – Return this many ascending sorted values.

  • keep ({'first', 'last', 'all'}, default 'first') –

    When there are duplicate values that cannot all fit in a Series of n elements:

    • first : return the first n occurrences in order of appearance.

    • last : return the last n occurrences in reverse order of appearance.

    • all : keep all occurrences. This can result in a Series of size larger than n.

Returns:

The n smallest values in the Series, sorted in increasing order.

Return type:

Series

See also

Series.nlargest

Get the n largest elements.

Series.sort_values

Sort Series by values.

Series.head

Return the first n rows.

Notes

Faster than .sort_values().head(n) for small n relative to the size of the Series object.

Examples

>>> countries_population = {"Italy": 59000000, "France": 65000000,
...                         "Brunei": 434000, "Malta": 434000,
...                         "Maldives": 434000, "Iceland": 337000,
...                         "Nauru": 11300, "Tuvalu": 11300,
...                         "Anguilla": 11300, "Montserrat": 5200}
>>> s = pd.Series(countries_population)
>>> s
Italy       59000000
France      65000000
Brunei        434000
Malta         434000
Maldives      434000
Iceland       337000
Nauru          11300
Tuvalu         11300
Anguilla       11300
Montserrat      5200
dtype: int64

The n smallest elements where n=5 by default.

>>> s.nsmallest()
Montserrat    5200
Nauru        11300
Tuvalu       11300
Anguilla     11300
Iceland     337000
dtype: int64

The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.

>>> s.nsmallest(3)
Montserrat   5200
Nauru       11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept since they are the last with value 11300 based on the index order.

>>> s.nsmallest(3, keep='last')
Montserrat   5200
Anguilla    11300
Tuvalu      11300
dtype: int64

The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has four elements due to the three duplicates.

>>> s.nsmallest(3, keep='all')
Montserrat   5200
Nauru       11300
Tuvalu      11300
Anguilla    11300
dtype: int64
swaplevel(i=-2, j=-1, copy=None)[source]

Swap levels i and j in a MultiIndex.

Default is to swap the two innermost levels of the index.

Parameters:
  • i (int or str) – Levels of the indices to be swapped. Can pass level name as string.

  • j (int or str) – Levels of the indices to be swapped. Can pass level name as string.

  • copy (bool, default True) – Whether to copy underlying data.

Returns:

Series with levels swapped in MultiIndex.

Return type:

Series

Examples

>>> s = pd.Series(
...     ["A", "B", "A", "C"],
...     index=[
...         ["Final exam", "Final exam", "Coursework", "Coursework"],
...         ["History", "Geography", "History", "Geography"],
...         ["January", "February", "March", "April"],
...     ],
... )
>>> s
Final exam  History     January      A
            Geography   February     B
Coursework  History     March        A
            Geography   April        C
dtype: object

In the following example, we will swap the levels of the indices. Here, we will swap the levels column-wise, but levels can be swapped row-wise in a similar manner. Note that column-wise is the default behaviour. By not supplying any arguments for i and j, we swap the last and second to last indices.

>>> s.swaplevel()
Final exam  January     History         A
            February    Geography       B
Coursework  March       History         A
            April       Geography       C
dtype: object

By supplying one argument, we can choose which index to swap the last index with. We can for example swap the first index with the last one as follows.

>>> s.swaplevel(0)
January     History     Final exam      A
February    Geography   Final exam      B
March       History     Coursework      A
April       Geography   Coursework      C
dtype: object

We can also define explicitly which indices we want to swap by supplying values for both i and j. Here, we for example swap the first and second indices.

>>> s.swaplevel(0, 1)
History     Final exam  January         A
Geography   Final exam  February        B
History     Coursework  March           A
Geography   Coursework  April           C
dtype: object
reorder_levels(order)[source]

Rearrange index levels using input order.

May not drop or duplicate levels.

Parameters:

order (list of int representing new level order) – Reference level by number or key.

Return type:

type of caller (new object)

explode(ignore_index=False)[source]

Transform each element of a list-like to a row.

Parameters:

ignore_index (bool, default False) –

If True, the resulting index will be labeled 0, 1, …, n - 1.

New in version 1.1.0.

Returns:

Exploded lists to rows; index will be duplicated for these rows.

Return type:

Series

See also

Series.str.split

Split string values on specified separator.

Series.unstack

Unstack, a.k.a. pivot, Series with MultiIndex to produce DataFrame.

DataFrame.melt

Unpivot a DataFrame from wide format to long format.

DataFrame.explode

Explode a DataFrame from list-like columns to long format.

Notes

This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when exploding sets.

Reference the user guide for more examples.

Examples

>>> s = pd.Series([[1, 2, 3], 'foo', [], [3, 4]])
>>> s
0    [1, 2, 3]
1          foo
2           []
3       [3, 4]
dtype: object
>>> s.explode()
0      1
0      2
0      3
1    foo
2    NaN
3      3
3      4
dtype: object
unstack(level=-1, fill_value=None)[source]

Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.

Parameters:
  • level (int, str, or list of these, default last level) – Level(s) to unstack, can pass level name.

  • fill_value (scalar value, default None) – Value to use when replacing NaN values.

Returns:

Unstacked Series.

Return type:

DataFrame

Notes

Reference the user guide for more examples.

Examples

>>> s = pd.Series([1, 2, 3, 4],
...               index=pd.MultiIndex.from_product([['one', 'two'],
...                                                 ['a', 'b']]))
>>> s
one  a    1
     b    2
two  a    3
     b    4
dtype: int64
>>> s.unstack(level=-1)
     a  b
one  1  2
two  3  4
>>> s.unstack(level=0)
   one  two
a    1    3
b    2    4
map(arg, na_action=None)[source]

Map values of Series according to an input mapping or function.

Used for substituting each value in a Series with another value, that may be derived from a function, a dict or a Series.

Parameters:
  • arg (function, collections.abc.Mapping subclass or Series) – Mapping correspondence.

  • na_action ({None, 'ignore'}, default None) – If ‘ignore’, propagate NaN values, without passing them to the mapping correspondence.

Returns:

Same index as caller.

Return type:

Series

See also

Series.apply

For applying more complex functions on a Series.

DataFrame.apply

Apply a function row-/column-wise.

DataFrame.applymap

Apply a function elementwise on a whole DataFrame.

Notes

When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN. However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for default values), then this default is used rather than NaN.

Examples

>>> s = pd.Series(['cat', 'dog', np.nan, 'rabbit'])
>>> s
0      cat
1      dog
2      NaN
3   rabbit
dtype: object

map accepts a dict or a Series. Values that are not found in the dict are converted to NaN, unless the dict has a default value (e.g. defaultdict):

>>> s.map({'cat': 'kitten', 'dog': 'puppy'})
0   kitten
1    puppy
2      NaN
3      NaN
dtype: object

It also accepts a function:

>>> s.map('I am a {}'.format)
0       I am a cat
1       I am a dog
2       I am a nan
3    I am a rabbit
dtype: object

To avoid applying the function to missing values (and keep them as NaN) na_action='ignore' can be used:

>>> s.map('I am a {}'.format, na_action='ignore')
0     I am a cat
1     I am a dog
2            NaN
3  I am a rabbit
dtype: object
aggregate(func=None, axis=0, *args, **kwargs)[source]

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

The return can be:

  • scalar : when Series.agg is called with single function

  • Series : when DataFrame.agg is called with a single function

  • DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

Return type:

scalar, Series or DataFrame

See also

Series.apply

Invoke function on a Series.

Series.transform

Transform function producing a Series with like indexes.

Notes

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.agg('min')
1
>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
agg(func=None, axis=0, *args, **kwargs)

Aggregate using one or more operations over the specified axis.

Parameters:
  • func (function, str, list or dict) –

    Function to use for aggregating the data. If a function, must either work when passed a Series or when passed to Series.apply.

    Accepted combinations are:

    • function

    • string function name

    • list of functions and/or function names, e.g. [np.sum, 'mean']

    • dict of axis labels -> functions, function names or list of such.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

The return can be:

  • scalar : when Series.agg is called with single function

  • Series : when DataFrame.agg is called with a single function

  • DataFrame : when DataFrame.agg is called with several functions

Return scalar, Series or DataFrame.

Return type:

scalar, Series or DataFrame

See also

Series.apply

Invoke function on a Series.

Series.transform

Transform function producing a Series with like indexes.

Notes

agg is an alias for aggregate. Use the alias.

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

A passed user-defined-function will be passed a Series for evaluation.

Examples

>>> s = pd.Series([1, 2, 3, 4])
>>> s
0    1
1    2
2    3
3    4
dtype: int64
>>> s.agg('min')
1
>>> s.agg(['min', 'max'])
min   1
max   4
dtype: int64
any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: None = ..., **kwargs) bool
any(*, axis: int | Literal['index', 'columns', 'rows'] = 0, bool_only: bool | None = None, skipna: bool = True, level: Hashable, **kwargs) Series | bool

Return whether any element is True, potentially over an axis.

Returns False unless there is at least one element within a series or along a Dataframe axis that is True or equivalent (e.g. non-zero or non-empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be False, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, Series is returned; otherwise, scalar is returned.

Return type:

scalar or Series

See also

numpy.any

Numpy version of this method.

Series.any

Return whether any element is True.

Series.all

Return whether all elements are True.

DataFrame.any

Return whether any element is True over requested axis.

DataFrame.all

Return whether all elements are True over requested axis.

Examples

Series

For Series input, the output is a scalar indicating whether any element is True.

>>> pd.Series([False, False]).any()
False
>>> pd.Series([True, False]).any()
True
>>> pd.Series([], dtype="float64").any()
False
>>> pd.Series([np.nan]).any()
False
>>> pd.Series([np.nan]).any(skipna=False)
True

DataFrame

Whether each column contains at least one True element (the default).

>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
   A  B  C
0  1  0  0
1  2  2  0
>>> df.any()
A     True
B     True
C    False
dtype: bool

Aggregating over the columns.

>>> df = pd.DataFrame({"A": [True, False], "B": [1, 2]})
>>> df
       A  B
0   True  1
1  False  2
>>> df.any(axis='columns')
0    True
1    True
dtype: bool
>>> df = pd.DataFrame({"A": [True, False], "B": [1, 0]})
>>> df
       A  B
0   True  1
1  False  0
>>> df.any(axis='columns')
0    True
1    False
dtype: bool

Aggregating over the entire DataFrame with axis=None.

>>> df.any(axis=None)
True

any for an empty DataFrame is an empty Series.

>>> pd.DataFrame([]).any()
Series([], dtype: bool)
transform(func, axis=0, *args, **kwargs)[source]

Call func on self producing a Series with the same axis shape as self.

Parameters:
  • func (function, str, list-like or dict-like) –

    Function to use for transforming the data. If a function, must either work when passed a Series or when passed to Series.apply. If func is both list-like and dict-like, dict-like behavior takes precedence.

    Accepted combinations are:

    • function

    • string function name

    • list-like of functions and/or function names, e.g. [np.exp, 'sqrt']

    • dict-like of axis labels -> functions, function names or list-like of such.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • *args – Positional arguments to pass to func.

  • **kwargs – Keyword arguments to pass to func.

Returns:

A Series that must have the same length as self.

Return type:

Series

:raises ValueError : If the returned Series has a different length than self.:

See also

Series.agg

Only perform aggregating type operations.

Series.apply

Invoke function on a Series.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

Examples

>>> df = pd.DataFrame({'A': range(3), 'B': range(1, 4)})
>>> df
   A  B
0  0  1
1  1  2
2  2  3
>>> df.transform(lambda x: x + 1)
   A  B
0  1  2
1  2  3
2  3  4

Even though the resulting Series must have the same length as the input Series, it is possible to provide several input functions:

>>> s = pd.Series(range(3))
>>> s
0    0
1    1
2    2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
       sqrt        exp
0  0.000000   1.000000
1  1.000000   2.718282
2  1.414214   7.389056

You can call transform on a GroupBy object:

>>> df = pd.DataFrame({
...     "Date": [
...         "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05",
...         "2015-05-08", "2015-05-07", "2015-05-06", "2015-05-05"],
...     "Data": [5, 8, 6, 1, 50, 100, 60, 120],
... })
>>> df
         Date  Data
0  2015-05-08     5
1  2015-05-07     8
2  2015-05-06     6
3  2015-05-05     1
4  2015-05-08    50
5  2015-05-07   100
6  2015-05-06    60
7  2015-05-05   120
>>> df.groupby('Date')['Data'].transform('sum')
0     55
1    108
2     66
3    121
4     55
5    108
6     66
7    121
Name: Data, dtype: int64
>>> df = pd.DataFrame({
...     "c": [1, 1, 1, 2, 2, 2, 2],
...     "type": ["m", "n", "o", "m", "m", "n", "n"]
... })
>>> df
   c type
0  1    m
1  1    n
2  1    o
3  2    m
4  2    m
5  2    n
6  2    n
>>> df['size'] = df.groupby('c')['type'].transform(len)
>>> df
   c type size
0  1    m    3
1  1    n    3
2  1    o    3
3  2    m    4
4  2    m    4
5  2    n    4
6  2    n    4
apply(func, convert_dtype=True, args=(), **kwargs)[source]

Invoke function on values of Series.

Can be ufunc (a NumPy function that applies to the entire Series) or a Python function that only works on single values.

Parameters:
  • func (function) – Python function or NumPy ufunc to apply.

  • convert_dtype (bool, default True) – Try to find better dtype for elementwise function results. If False, leave as dtype=object. Note that the dtype is always preserved for some extension array dtypes, such as Categorical.

  • args (tuple) – Positional arguments passed to func after the series value.

  • **kwargs – Additional keyword arguments passed to func.

Returns:

If func returns a Series object the result will be a DataFrame.

Return type:

Series or DataFrame

See also

Series.map

For element-wise operations.

Series.agg

Only perform aggregating type operations.

Series.transform

Only perform transforming type operations.

Notes

Functions that mutate the passed object can produce unexpected behavior or errors and are not supported. See gotchas.udf-mutation for more details.

Examples

Create a series with typical summer temperatures for each city.

>>> s = pd.Series([20, 21, 12],
...               index=['London', 'New York', 'Helsinki'])
>>> s
London      20
New York    21
Helsinki    12
dtype: int64

Square the values by defining a function and passing it as an argument to apply().

>>> def square(x):
...     return x ** 2
>>> s.apply(square)
London      400
New York    441
Helsinki    144
dtype: int64

Square the values by passing an anonymous function as an argument to apply().

>>> s.apply(lambda x: x ** 2)
London      400
New York    441
Helsinki    144
dtype: int64

Define a custom function that needs additional positional arguments and pass these additional arguments using the args keyword.

>>> def subtract_custom_value(x, custom_value):
...     return x - custom_value
>>> s.apply(subtract_custom_value, args=(5,))
London      15
New York    16
Helsinki     7
dtype: int64

Define a custom function that takes keyword arguments and pass these arguments to apply.

>>> def add_custom_values(x, **kwargs):
...     for month in kwargs:
...         x += kwargs[month]
...     return x
>>> s.apply(add_custom_values, june=30, july=20, august=25)
London      95
New York    96
Helsinki    87
dtype: int64

Use a function from the Numpy library.

>>> s.apply(np.log)
London      2.995732
New York    3.044522
Helsinki    2.484907
dtype: float64
align(other, join='outer', axis=None, level=None, copy=None, fill_value=None, method=None, limit=None, fill_axis=0, broadcast_axis=None)[source]

Align two objects on their axes with the specified join method.

Join method is specified for each axis Index.

Parameters:
  • other (DataFrame or Series) –

  • join ({'outer', 'inner', 'left', 'right'}, default 'outer') –

  • axis (allowed axis of the other object, default None) – Align on index (0), columns (1), or both (None).

  • level (int or level name, default None) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • copy (bool, default True) – Always returns new objects. If copy=False and no reindexing is required then original objects are returned.

  • fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • method ({'backfill', 'bfill', 'pad', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed Series:

    • pad / ffill: propagate last valid observation forward to next valid.

    • backfill / bfill: use NEXT valid observation to fill gap.

  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

  • fill_axis ({0 or 'index'}, default 0) – Filling axis, method and limit.

  • broadcast_axis ({0 or 'index'}, default None) – Broadcast values along this axis, if aligning two objects of different dimensions.

Returns:

Aligned objects.

Return type:

tuple of (Series, type of other)

Examples

>>> df = pd.DataFrame(
...     [[1, 2, 3, 4], [6, 7, 8, 9]], columns=["D", "B", "E", "A"], index=[1, 2]
... )
>>> other = pd.DataFrame(
...     [[10, 20, 30, 40], [60, 70, 80, 90], [600, 700, 800, 900]],
...     columns=["A", "B", "C", "D"],
...     index=[2, 3, 4],
... )
>>> df
   D  B  E  A
1  1  2  3  4
2  6  7  8  9
>>> other
    A    B    C    D
2   10   20   30   40
3   60   70   80   90
4  600  700  800  900

Align on columns:

>>> left, right = df.align(other, join="outer", axis=1)
>>> left
   A  B   C  D  E
1  4  2 NaN  1  3
2  9  7 NaN  6  8
>>> right
    A    B    C    D   E
2   10   20   30   40 NaN
3   60   70   80   90 NaN
4  600  700  800  900 NaN

We can also align on the index:

>>> left, right = df.align(other, join="outer", axis=0)
>>> left
    D    B    E    A
1  1.0  2.0  3.0  4.0
2  6.0  7.0  8.0  9.0
3  NaN  NaN  NaN  NaN
4  NaN  NaN  NaN  NaN
>>> right
    A      B      C      D
1    NaN    NaN    NaN    NaN
2   10.0   20.0   30.0   40.0
3   60.0   70.0   80.0   90.0
4  600.0  700.0  800.0  900.0

Finally, the default axis=None will align on both index and columns:

>>> left, right = df.align(other, join="outer", axis=None)
>>> left
     A    B   C    D    E
1  4.0  2.0 NaN  1.0  3.0
2  9.0  7.0 NaN  6.0  8.0
3  NaN  NaN NaN  NaN  NaN
4  NaN  NaN NaN  NaN  NaN
>>> right
       A      B      C      D   E
1    NaN    NaN    NaN    NaN NaN
2   10.0   20.0   30.0   40.0 NaN
3   60.0   70.0   80.0   90.0 NaN
4  600.0  700.0  800.0  900.0 NaN
rename(index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | Hashable | None = None, *, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool = True, inplace: Literal[True], level: Hashable | None = None, errors: Literal['ignore', 'raise'] = 'ignore') None[source]
rename(index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | Hashable | None = None, *, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool = True, inplace: Literal[False] = False, level: Hashable | None = None, errors: Literal['ignore', 'raise'] = 'ignore') Series
rename(index: Mapping[Any, Hashable] | Callable[[Any], Hashable] | Hashable | None = None, *, axis: int | Literal['index', 'columns', 'rows'] | None = None, copy: bool = True, inplace: bool = False, level: Hashable | None = None, errors: Literal['ignore', 'raise'] = 'ignore') Series | None

Alter Series index labels or name.

Function / dict values must be unique (1-to-1). Labels not contained in a dict / Series will be left as-is. Extra labels listed don’t throw an error.

Alternatively, change Series.name with a scalar value.

See the user guide for more.

Parameters:
  • index (scalar, hashable sequence, dict-like or function optional) – Functions or dict-like are transformations to apply to the index. Scalar or hashable sequence-like will alter the Series.name attribute.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • copy (bool, default True) – Also copy underlying data.

  • inplace (bool, default False) – Whether to return a new Series. If True the value of copy is ignored.

  • level (int or level name, default None) – In case of MultiIndex, only rename labels in the specified level.

  • errors ({'ignore', 'raise'}, default 'ignore') – If ‘raise’, raise KeyError when a dict-like mapper or index contains labels that are not present in the index being transformed. If ‘ignore’, existing keys will be renamed and extra keys will be ignored.

Returns:

Series with index labels or name altered or None if inplace=True.

Return type:

Series or None

See also

DataFrame.rename

Corresponding DataFrame method.

Series.rename_axis

Set the name of the axis.

Examples

>>> s = pd.Series([1, 2, 3])
>>> s
0    1
1    2
2    3
dtype: int64
>>> s.rename("my_name")  # scalar, changes Series.name
0    1
1    2
2    3
Name: my_name, dtype: int64
>>> s.rename(lambda x: x ** 2)  # function, changes labels
0    1
1    2
4    3
dtype: int64
>>> s.rename({1: 3, 2: 5})  # mapping, changes labels
0    1
3    2
5    3
dtype: int64
set_axis(labels, *, axis=0, copy=None)[source]

Assign desired index to given axis.

Indexes for row labels can be changed by assigning a list-like or Index.

Parameters:
  • labels (list-like, Index) – The values for the new index.

  • axis ({0 or 'index'}, default 0) – The axis to update. The value 0 identifies the rows. For Series this parameter is unused and defaults to 0.

  • copy (bool, default True) –

    Whether to make a copy of the underlying data.

    New in version 1.5.0.

Returns:

An object of type Series.

Return type:

Series

See also

Series.rename_axis

Alter the name of the index. Examples ——– >>> s = pd.Series([1, 2, 3]) >>> s 0 1 1 2 2 3

dtype

int64 >>> s.set_axis([‘a’, ‘b’, ‘c’], axis=0) a 1 b 2 c 3

dtype

int64

reindex(index=None, *, axis=None, method=None, copy=None, level=None, fill_value=None, limit=None, tolerance=None)[source]

Conform Series to new index with optional filling logic.

Places NA/NaN in locations having no value in the previous index. A new object is produced unless the new index is equivalent to the current one and copy=False.

Parameters:
  • index (array-like, optional) – New labels for the index. Preferably an Index object to avoid duplicating data.

  • axis (int or str, optional) – Unused.

  • method ({None, 'backfill'/'bfill', 'pad'/'ffill', 'nearest'}) –

    Method to use for filling holes in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series with a monotonically increasing/decreasing index.

    • None (default): don’t fill gaps

    • pad / ffill: Propagate last valid observation forward to next valid.

    • backfill / bfill: Use next valid observation to fill gap.

    • nearest: Use nearest valid observations to fill gap.

  • copy (bool, default True) – Return a new object, even if the passed indexes are the same.

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (scalar, default np.NaN) – Value to use for missing values. Defaults to NaN, but can be any “compatible” value.

  • limit (int, default None) – Maximum number of consecutive elements to forward or backward fill.

  • tolerance (optional) –

    Maximum distance between original and new labels for inexact matches. The values of the index at the matching locations most satisfy the equation abs(index[indexer] - target) <= tolerance.

    Tolerance may be a scalar value, which applies the same tolerance to all values, or list-like, which applies variable tolerance per element. List-like includes list, tuple, array, Series, and must be the same size as the index and its dtype must exactly match the index’s type.

Return type:

Series with changed index.

See also

DataFrame.set_index

Set row labels.

DataFrame.reset_index

Remove row labels or move them to new columns.

DataFrame.reindex_like

Change to same indices as other DataFrame.

Examples

DataFrame.reindex supports two calling conventions

  • (index=index_labels, columns=column_labels, ...)

  • (labels, axis={'index', 'columns'}, ...)

We highly recommend using keyword arguments to clarify your intent.

Create a dataframe with some fictional data.

>>> index = ['Firefox', 'Chrome', 'Safari', 'IE10', 'Konqueror']
>>> df = pd.DataFrame({'http_status': [200, 200, 404, 404, 301],
...                   'response_time': [0.04, 0.02, 0.07, 0.08, 1.0]},
...                   index=index)
>>> df
           http_status  response_time
Firefox            200           0.04
Chrome             200           0.02
Safari             404           0.07
IE10               404           0.08
Konqueror          301           1.00

Create a new index and reindex the dataframe. By default values in the new index that do not have corresponding records in the dataframe are assigned NaN.

>>> new_index = ['Safari', 'Iceweasel', 'Comodo Dragon', 'IE10',
...              'Chrome']
>>> df.reindex(new_index)
               http_status  response_time
Safari               404.0           0.07
Iceweasel              NaN            NaN
Comodo Dragon          NaN            NaN
IE10                 404.0           0.08
Chrome               200.0           0.02

We can fill in the missing values by passing a value to the keyword fill_value. Because the index is not monotonically increasing or decreasing, we cannot use arguments to the keyword method to fill the NaN values.

>>> df.reindex(new_index, fill_value=0)
               http_status  response_time
Safari                 404           0.07
Iceweasel                0           0.00
Comodo Dragon            0           0.00
IE10                   404           0.08
Chrome                 200           0.02
>>> df.reindex(new_index, fill_value='missing')
              http_status response_time
Safari                404          0.07
Iceweasel         missing       missing
Comodo Dragon     missing       missing
IE10                  404          0.08
Chrome                200          0.02

We can also reindex the columns.

>>> df.reindex(columns=['http_status', 'user_agent'])
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

Or we can use “axis-style” keyword arguments

>>> df.reindex(['http_status', 'user_agent'], axis="columns")
           http_status  user_agent
Firefox            200         NaN
Chrome             200         NaN
Safari             404         NaN
IE10               404         NaN
Konqueror          301         NaN

To further illustrate the filling functionality in reindex, we will create a dataframe with a monotonically increasing index (for example, a sequence of dates).

>>> date_index = pd.date_range('1/1/2010', periods=6, freq='D')
>>> df2 = pd.DataFrame({"prices": [100, 101, np.nan, 100, 89, 88]},
...                    index=date_index)
>>> df2
            prices
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0

Suppose we decide to expand the dataframe to cover a wider date range.

>>> date_index2 = pd.date_range('12/29/2009', periods=10, freq='D')
>>> df2.reindex(date_index2)
            prices
2009-12-29     NaN
2009-12-30     NaN
2009-12-31     NaN
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0
2010-01-07     NaN

The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by default filled with NaN. If desired, we can fill in the missing values using one of several options.

For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an argument to the method keyword.

>>> df2.reindex(date_index2, method='bfill')
            prices
2009-12-29   100.0
2009-12-30   100.0
2009-12-31   100.0
2010-01-01   100.0
2010-01-02   101.0
2010-01-03     NaN
2010-01-04   100.0
2010-01-05    89.0
2010-01-06    88.0
2010-01-07     NaN

Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will not be filled by any of the value propagation schemes. This is because filling while reindexing does not look at dataframe values, but only compares the original and desired indexes. If you do want to fill in the NaN values present in the original dataframe, use the fillna() method.

See the user guide for more.

rename_axis(mapper=_NoDefault.no_default, *, index=_NoDefault.no_default, axis=0, copy=True, inplace=False)[source]

Set the name of the axis for the index or columns.

Parameters:
  • mapper (scalar, list-like, optional) – Value to set the axis name attribute.

  • index (scalar, list-like, dict-like or function, optional) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • columns (scalar, list-like, dict-like or function, optional) –

    A scalar, list-like, dict-like or functions transformations to apply to that axis’ values. Note that the columns parameter is not allowed if the object is a Series. This parameter only apply for DataFrame type objects.

    Use either mapper and axis to specify the axis to target with mapper, or index and/or columns.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The axis to rename. For Series this parameter is unused and defaults to 0.

  • copy (bool, default None) – Also copy underlying data.

  • inplace (bool, default False) – Modifies the object directly, instead of creating a new Series or DataFrame.

  • self (Series) –

Returns:

The same type as the caller or None if inplace=True.

Return type:

Series, DataFrame, or None

See also

Series.rename

Alter Series index labels or name.

DataFrame.rename

Alter DataFrame index labels or name.

Index.rename

Set new names on index.

Notes

DataFrame.rename_axis supports two calling conventions

  • (index=index_mapper, columns=columns_mapper, ...)

  • (mapper, axis={'index', 'columns'}, ...)

The first calling convention will only modify the names of the index and/or the names of the Index object that is the columns. In this case, the parameter copy is ignored.

The second calling convention will modify the names of the corresponding index if mapper is a list or a scalar. However, if mapper is dict-like or a function, it will use the deprecated behavior of modifying the axis labels.

We highly recommend using keyword arguments to clarify your intent.

Examples

Series

>>> s = pd.Series(["dog", "cat", "monkey"])
>>> s
0       dog
1       cat
2    monkey
dtype: object
>>> s.rename_axis("animal")
animal
0    dog
1    cat
2    monkey
dtype: object

DataFrame

>>> df = pd.DataFrame({"num_legs": [4, 4, 2],
...                    "num_arms": [0, 0, 2]},
...                   ["dog", "cat", "monkey"])
>>> df
        num_legs  num_arms
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("animal")
>>> df
        num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2
>>> df = df.rename_axis("limbs", axis="columns")
>>> df
limbs   num_legs  num_arms
animal
dog            4         0
cat            4         0
monkey         2         2

MultiIndex

>>> df.index = pd.MultiIndex.from_product([['mammal'],
...                                        ['dog', 'cat', 'monkey']],
...                                       names=['type', 'name'])
>>> df
limbs          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
>>> df.rename_axis(index={'type': 'class'})
limbs          num_legs  num_arms
class  name
mammal dog            4         0
       cat            4         0
       monkey         2         2
>>> df.rename_axis(columns=str.upper)
LIMBS          num_legs  num_arms
type   name
mammal dog            4         0
       cat            4         0
       monkey         2         2
drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable | None = None, inplace: Literal[True], errors: Literal['ignore', 'raise'] = 'raise') None[source]
drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable | None = None, inplace: Literal[False] = False, errors: Literal['ignore', 'raise'] = 'raise') Series
drop(labels: Hashable | Sequence[Hashable] = None, *, axis: int | Literal['index', 'columns', 'rows'] = 0, index: Hashable | Sequence[Hashable] = None, columns: Hashable | Sequence[Hashable] = None, level: Hashable | None = None, inplace: bool = False, errors: Literal['ignore', 'raise'] = 'raise') Series | None

Return Series with specified index labels removed.

Remove elements of a Series based on specifying the index labels. When using a multi-index, labels on different levels can be removed by specifying the level.

Parameters:
  • labels (single label or list-like) – Index labels to drop.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • index (single label or list-like) – Redundant for application on Series, but ‘index’ can be used instead of ‘labels’.

  • columns (single label or list-like) – No change is made to the Series; use ‘index’ or ‘labels’ instead.

  • level (int or level name, optional) – For MultiIndex, level for which the labels will be removed.

  • inplace (bool, default False) – If True, do operation inplace and return None.

  • errors ({'ignore', 'raise'}, default 'raise') – If ‘ignore’, suppress error and only existing labels are dropped.

Returns:

Series with specified index labels removed or None if inplace=True.

Return type:

Series or None

Raises:

KeyError – If none of the labels are found in the index.

See also

Series.reindex

Return only specified index labels of Series.

Series.dropna

Return series without null values.

Series.drop_duplicates

Return Series with duplicate values removed.

DataFrame.drop

Drop specified labels from rows or columns.

Examples

>>> s = pd.Series(data=np.arange(3), index=['A', 'B', 'C'])
>>> s
A  0
B  1
C  2
dtype: int64

Drop labels B en C

>>> s.drop(labels=['B', 'C'])
A  0
dtype: int64

Drop 2nd level label in MultiIndex Series

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      codes=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                             [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> s = pd.Series([45, 200, 1.2, 30, 250, 1.5, 320, 1, 0.3],
...               index=midx)
>>> s
lama    speed      45.0
        weight    200.0
        length      1.2
cow     speed      30.0
        weight    250.0
        length      1.5
falcon  speed     320.0
        weight      1.0
        length      0.3
dtype: float64
>>> s.drop(labels='weight', level=1)
lama    speed      45.0
        length      1.2
cow     speed      30.0
        length      1.5
falcon  speed     320.0
        length      0.3
dtype: float64
fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[False] = False, limit: int | None = None, downcast: dict | None = None) Series[source]
fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: Literal[True], limit: int | None = None, downcast: dict | None = None) None
fillna(value: Hashable | Mapping | Series | DataFrame = None, *, method: Literal['backfill', 'bfill', 'ffill', 'pad'] | None = None, axis: int | Literal['index', 'columns', 'rows'] | None = None, inplace: bool = False, limit: int | None = None, downcast: dict | None = None) Series | None

Fill NA/NaN values using the specified method.

Parameters:
  • value (scalar, dict, Series, or DataFrame) – Value to use to fill holes (e.g. 0), alternately a dict/Series/DataFrame of values specifying which value to use for each index (for a Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not be filled. This value cannot be a list.

  • method ({'backfill', 'bfill', 'ffill', None}, default None) –

    Method to use for filling holes in reindexed Series:

    • ffill: propagate last valid observation forward to next valid.

    • backfill / bfill: use next valid observation to fill gap.

  • axis ({0 or 'index'}) – Axis along which to fill missing values. For Series this parameter is unused and defaults to 0.

  • inplace (bool, default False) – If True, fill in-place. Note: this will modify any other views on this object (e.g., a no-copy slice for a column in a DataFrame).

  • limit (int, default None) – If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap with more than this number of consecutive NaNs, it will only be partially filled. If method is not specified, this is the maximum number of entries along the entire axis where NaNs will be filled. Must be greater than 0 if not None.

  • downcast (dict, default is None) – A dict of item->dtype of what to downcast if possible, or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to int64 if possible).

Returns:

Object with missing values filled or None if inplace=True.

Return type:

Series or None

See also

interpolate

Fill NaN values using interpolation.

reindex

Conform object to new index.

asfreq

Convert TimeSeries to specified frequency.

Examples

>>> df = pd.DataFrame([[np.nan, 2, np.nan, 0],
...                    [3, 4, np.nan, 1],
...                    [np.nan, np.nan, np.nan, np.nan],
...                    [np.nan, 3, np.nan, 4]],
...                   columns=list("ABCD"))
>>> df
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  NaN  NaN NaN  NaN
3  NaN  3.0 NaN  4.0

Replace all NaN elements with 0s.

>>> df.fillna(0)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  0.0
3  0.0  3.0  0.0  4.0

We can also propagate non-null values forward or backward.

>>> df.fillna(method="ffill")
     A    B   C    D
0  NaN  2.0 NaN  0.0
1  3.0  4.0 NaN  1.0
2  3.0  4.0 NaN  1.0
3  3.0  3.0 NaN  4.0

Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.

>>> values = {"A": 0, "B": 1, "C": 2, "D": 3}
>>> df.fillna(value=values)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  2.0  1.0
2  0.0  1.0  2.0  3.0
3  0.0  3.0  2.0  4.0

Only replace the first NaN element.

>>> df.fillna(value=values, limit=1)
     A    B    C    D
0  0.0  2.0  2.0  0.0
1  3.0  4.0  NaN  1.0
2  NaN  1.0  NaN  3.0
3  NaN  3.0  NaN  4.0

When filling using a DataFrame, replacement happens along the same column names and same indices

>>> df2 = pd.DataFrame(np.zeros((4, 4)), columns=list("ABCE"))
>>> df.fillna(df2)
     A    B    C    D
0  0.0  2.0  0.0  0.0
1  3.0  4.0  0.0  1.0
2  0.0  0.0  0.0  NaN
3  0.0  3.0  0.0  4.0

Note that column D is not affected since it is not present in df2.

pop(item)[source]

Return item and drops from series. Raise KeyError if not found.

Parameters:

item (label) – Index of the element that needs to be removed.

Return type:

Value that is popped from series.

Examples

>>> ser = pd.Series([1,2,3])
>>> ser.pop(0)
1
>>> ser
1    2
2    3
dtype: int64
replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[False] = False, limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) Series[source]
replace(to_replace=None, value=_NoDefault.no_default, *, inplace: ~typing.Literal[True], limit: int | None = None, regex: bool = False, method: ~typing.Literal['pad', 'ffill', 'bfill'] | ~typing.Literal[<no_default>] = _NoDefault.no_default) None

Replace values given in to_replace with value.

Values of the Series are replaced with other values dynamically.

This differs from updating with .loc or .iloc, which require you to specify a location to update with some value.

Parameters:
  • to_replace (str, regex, list, dict, Series, int, float, or None) –

    How to find the values that will be replaced.

    • numeric, str or regex:

      • numeric: numeric values equal to to_replace will be replaced with value

      • str: string exactly matching to_replace will be replaced with value

      • regex: regexs matching to_replace will be replaced with value

    • list of str, regex, or numeric:

      • First, if to_replace and value are both lists, they must be the same length.

      • Second, if regex=True then all of the strings in both lists will be interpreted as regexs otherwise they will match directly. This doesn’t matter much for value since there are only a few possible substitution regexes you can use.

      • str, regex and numeric rules apply as above.

    • dict:

      • Dicts can be used to specify different replacement values for different existing values. For example, {'a': 'b', 'y': 'z'} replaces the value ‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way, the optional value parameter should not be given.

      • For a DataFrame a dict can specify that different values should be replaced in different columns. For example, {'a': 1, 'b': 'z'} looks for the value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with whatever is specified in value. The value parameter should not be None in this case. You can treat this as a special case of passing two lists except that you are specifying the column to search in.

      • For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN. The optional value parameter should not be specified to use a nested dict in this way. You can nest regular expressions as well. Note that column names (the top-level dictionary keys in a nested dictionary) cannot be regular expressions.

    • None:

      • This means that the regex argument must be a string, compiled regular expression, or list, dict, ndarray or Series of such elements. If value is also None then this must be a nested dictionary or Series.

    See the examples section for examples of each of these.

  • value (scalar, dict, list, str, regex, default None) – Value to replace any values matching to_replace with. For a DataFrame a dict of values can be used to specify which value to use for each column (columns not in the dict will not be filled). Regular expressions, strings and lists or dicts of such objects are also allowed.

  • inplace (bool, default False) – If True, performs operation inplace and returns None.

  • limit (int, default None) – Maximum size gap to forward or backward fill.

  • regex (bool or same types as to_replace, default False) – Whether to interpret to_replace and/or value as regular expressions. If this is True then to_replace must be a string. Alternatively, this could be a regular expression or a list, dict, or array of regular expressions in which case to_replace must be None.

  • method ({'pad', 'ffill', 'bfill'}) – The method to use when for replacement, when to_replace is a scalar, list or tuple and value is None.

Returns:

Object after replacement.

Return type:

Series

Raises:
  • AssertionError

    • If regex is not a bool and to_replace is not None.

  • TypeError

    • If to_replace is not a scalar, array-like, dict, or None * If to_replace is a dict and value is not a list, dict, ndarray, or Series * If to_replace is None and regex is not compilable into a regular expression or is a list, dict, ndarray, or Series. * When replacing multiple bool or datetime64 objects and the arguments to to_replace does not match the type of the value being replaced

  • ValueError

    • If a list or an ndarray is passed to to_replace and value but they are not the same length.

See also

Series.fillna

Fill NA values.

Series.where

Replace values based on boolean condition.

Series.str.replace

Simple string replacement.

Notes

  • Regex substitution is performed under the hood with re.sub. The rules for substitution for re.sub are the same.

  • Regular expressions will only substitute on strings, meaning you cannot provide, for example, a regular expression matching floating point numbers and expect the columns in your frame that have a numeric dtype to be matched. However, if those floating point numbers are strings, then you can do this.

  • This method has a lot of options. You are encouraged to experiment and play with this method to gain intuition about how it works.

  • When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and value(s) in the dict are the value parameter.

Examples

Scalar `to_replace` and `value`

>>> s = pd.Series([1, 2, 3, 4, 5])
>>> s.replace(1, 5)
0    5
1    2
2    3
3    4
4    5
dtype: int64
>>> df = pd.DataFrame({'A': [0, 1, 2, 3, 4],
...                    'B': [5, 6, 7, 8, 9],
...                    'C': ['a', 'b', 'c', 'd', 'e']})
>>> df.replace(0, 5)
    A  B  C
0  5  5  a
1  1  6  b
2  2  7  c
3  3  8  d
4  4  9  e

List-like `to_replace`

>>> df.replace([0, 1, 2, 3], 4)
    A  B  C
0  4  5  a
1  4  6  b
2  4  7  c
3  4  8  d
4  4  9  e
>>> df.replace([0, 1, 2, 3], [4, 3, 2, 1])
    A  B  C
0  4  5  a
1  3  6  b
2  2  7  c
3  1  8  d
4  4  9  e
>>> s.replace([1, 2], method='bfill')
0    3
1    3
2    3
3    4
4    5
dtype: int64

dict-like `to_replace`

>>> df.replace({0: 10, 1: 100})
        A  B  C
0   10  5  a
1  100  6  b
2    2  7  c
3    3  8  d
4    4  9  e
>>> df.replace({'A': 0, 'B': 5}, 100)
        A    B  C
0  100  100  a
1    1    6  b
2    2    7  c
3    3    8  d
4    4    9  e
>>> df.replace({'A': {0: 100, 4: 400}})
        A  B  C
0  100  5  a
1    1  6  b
2    2  7  c
3    3  8  d
4  400  9  e

Regular expression `to_replace`

>>> df = pd.DataFrame({'A': ['bat', 'foo', 'bait'],
...                    'B': ['abc', 'bar', 'xyz']})
>>> df.replace(to_replace=r'^ba.$', value='new', regex=True)
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace({'A': r'^ba.$'}, {'A': 'new'}, regex=True)
        A    B
0   new  abc
1   foo  bar
2  bait  xyz
>>> df.replace(regex=r'^ba.$', value='new')
        A    B
0   new  abc
1   foo  new
2  bait  xyz
>>> df.replace(regex={r'^ba.$': 'new', 'foo': 'xyz'})
        A    B
0   new  abc
1   xyz  new
2  bait  xyz
>>> df.replace(regex=[r'^ba.$', 'foo'], value='new')
        A    B
0   new  abc
1   new  new
2  bait  xyz

Compare the behavior of s.replace({'a': None}) and s.replace('a', None) to understand the peculiarities of the to_replace parameter:

>>> s = pd.Series([10, 'a', 'a', 'b', 'a'])

When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value parameter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None}, value=None, method=None):

>>> s.replace({'a': None})
0      10
1    None
2    None
3       b
4    None
dtype: object

When value is not explicitly passed and to_replace is a scalar, list or tuple, replace uses the method parameter (default ‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2 and ‘b’ in row 4 in this case.

>>> s.replace('a')
0    10
1    10
2    10
3     b
4     b
dtype: object

On the other hand, if None is explicitly passed for value, it will be respected:

>>> s.replace('a', None)
0      10
1    None
2    None
3       b
4    None
dtype: object

Changed in version 1.4.0: Previously the explicit None was silently ignored.

info(verbose=None, buf=None, max_cols=None, memory_usage=None, show_counts=True)[source]

Print a concise summary of a Series.

This method prints information about a Series including the index dtype, non-null values and memory usage.

New in version 1.4.0.

Parameters:
  • verbose (bool, optional) – Whether to print the full summary. By default, the setting in pandas.options.display.max_info_columns is followed.

  • buf (writable buffer, defaults to sys.stdout) – Where to send the output. By default, the output is printed to sys.stdout. Pass a writable buffer if you need to further process the output.

  • memory_usage (bool, str, optional) –

    Specifies whether total memory usage of the Series elements (including the index) should be displayed. By default, this follows the pandas.options.display.memory_usage setting.

    True always show memory usage. False never shows memory usage. A value of ‘deep’ is equivalent to “True with deep introspection”. Memory usage is shown in human-readable units (base-2 representation). Without deep introspection a memory estimation is made based in column dtype and number of rows assuming values consume the same memory amount for corresponding dtypes. With deep memory introspection, a real memory usage calculation is performed at the cost of computational resources. See the Frequently Asked Questions for more details.

  • show_counts (bool, optional) – Whether to show the non-null counts. By default, this is shown only if the DataFrame is smaller than pandas.options.display.max_info_rows and pandas.options.display.max_info_columns. A value of True always shows the counts, and False never shows the counts.

  • max_cols (int | None) –

Returns:

This method prints a summary of a Series and returns None.

Return type:

None

See also

Series.describe

Generate descriptive statistics of Series.

Series.memory_usage

Memory usage of Series.

Examples

>>> int_values = [1, 2, 3, 4, 5]
>>> text_values = ['alpha', 'beta', 'gamma', 'delta', 'epsilon']
>>> s = pd.Series(text_values, index=int_values)
>>> s.info()
<class 'pandas.core.series.Series'>
Index: 5 entries, 1 to 5
Series name: None
Non-Null Count  Dtype
--------------  -----
5 non-null      object
dtypes: object(1)
memory usage: 80.0+ bytes

Prints a summary excluding information about its values:

>>> s.info(verbose=False)
<class 'pandas.core.series.Series'>
Index: 5 entries, 1 to 5
dtypes: object(1)
memory usage: 80.0+ bytes

Pipe output of Series.info to buffer instead of sys.stdout, get buffer content and writes to a text file:

>>> import io
>>> buffer = io.StringIO()
>>> s.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
...           encoding="utf-8") as f:  
...     f.write(s)
260

The memory_usage parameter allows deep introspection mode, specially useful for big Series and fine-tune memory optimization:

>>> random_strings_array = np.random.choice(['a', 'b', 'c'], 10 ** 6)
>>> s = pd.Series(np.random.choice(['a', 'b', 'c'], 10 ** 6))
>>> s.info()
<class 'pandas.core.series.Series'>
RangeIndex: 1000000 entries, 0 to 999999
Series name: None
Non-Null Count    Dtype
--------------    -----
1000000 non-null  object
dtypes: object(1)
memory usage: 7.6+ MB
>>> s.info(memory_usage='deep')
<class 'pandas.core.series.Series'>
RangeIndex: 1000000 entries, 0 to 999999
Series name: None
Non-Null Count    Dtype
--------------    -----
1000000 non-null  object
dtypes: object(1)
memory usage: 55.3 MB
shift(periods=1, freq=None, axis=0, fill_value=None)[source]

Shift index by desired number of periods with an optional time freq.

When freq is not passed, shift the index without realigning the data. If freq is passed (in this case, the index must be date or datetime, or it will raise a NotImplementedError), the index will be increased using the periods and the freq. freq can be inferred when specified as “infer” as long as either freq or inferred_freq attribute is set in the index.

Parameters:
  • periods (int) – Number of periods to shift. Can be positive or negative.

  • freq (DateOffset, tseries.offsets, timedelta, or str, optional) – Offset to use from the tseries module or time rule (e.g. ‘EOM’). If freq is specified then the index values are shifted but the data is not realigned. That is, use freq if you would like to extend the index when shifting and preserve the original data. If freq is specified as “infer” then it will be inferred from the freq or inferred_freq attributes of the index. If neither of those attributes exist, a ValueError is thrown.

  • axis ({0 or 'index', 1 or 'columns', None}, default None) – Shift direction. For Series this parameter is unused and defaults to 0.

  • fill_value (object, optional) –

    The scalar value to use for newly introduced missing values. the default depends on the dtype of self. For numeric data, np.nan is used. For datetime, timedelta, or period data, etc. NaT is used. For extension dtypes, self.dtype.na_value is used.

    Changed in version 1.1.0.

Returns:

Copy of input object, shifted.

Return type:

Series

See also

Index.shift

Shift values of Index.

DatetimeIndex.shift

Shift values of DatetimeIndex.

PeriodIndex.shift

Shift values of PeriodIndex.

Examples

>>> df = pd.DataFrame({"Col1": [10, 20, 15, 30, 45],
...                    "Col2": [13, 23, 18, 33, 48],
...                    "Col3": [17, 27, 22, 37, 52]},
...                   index=pd.date_range("2020-01-01", "2020-01-05"))
>>> df
            Col1  Col2  Col3
2020-01-01    10    13    17
2020-01-02    20    23    27
2020-01-03    15    18    22
2020-01-04    30    33    37
2020-01-05    45    48    52
>>> df.shift(periods=3)
            Col1  Col2  Col3
2020-01-01   NaN   NaN   NaN
2020-01-02   NaN   NaN   NaN
2020-01-03   NaN   NaN   NaN
2020-01-04  10.0  13.0  17.0
2020-01-05  20.0  23.0  27.0
>>> df.shift(periods=1, axis="columns")
            Col1  Col2  Col3
2020-01-01   NaN    10    13
2020-01-02   NaN    20    23
2020-01-03   NaN    15    18
2020-01-04   NaN    30    33
2020-01-05   NaN    45    48
>>> df.shift(periods=3, fill_value=0)
            Col1  Col2  Col3
2020-01-01     0     0     0
2020-01-02     0     0     0
2020-01-03     0     0     0
2020-01-04    10    13    17
2020-01-05    20    23    27
>>> df.shift(periods=3, freq="D")
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
>>> df.shift(periods=3, freq="infer")
            Col1  Col2  Col3
2020-01-04    10    13    17
2020-01-05    20    23    27
2020-01-06    15    18    22
2020-01-07    30    33    37
2020-01-08    45    48    52
add(other, level=None, fill_value=None, axis=0)

Return Addition of series and other, element-wise (binary operator add).

Equivalent to series + other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.radd

Reverse of the Addition operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
all(axis=0, bool_only=None, skipna=True, **kwargs)

Return whether all elements are True, potentially over an axis.

Returns True unless there at least one element within a series or along a Dataframe axis that is False or equivalent (e.g. zero or empty).

Parameters:
  • axis ({0 or 'index', 1 or 'columns', None}, default 0) –

    Indicate which axis or axes should be reduced. For Series this parameter is unused and defaults to 0.

    • 0 / ‘index’ : reduce the index, return a Series whose index is the original column labels.

    • 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.

    • None : reduce all axes, return a scalar.

  • bool_only (bool, default None) – Include only boolean columns. If None, will attempt to use everything, then use only boolean data. Not implemented for Series.

  • skipna (bool, default True) – Exclude NA/null values. If the entire row/column is NA and skipna is True, then the result will be True, as for an empty row/column. If skipna is False, then NA are treated as True, because these are not equal to zero.

  • **kwargs (any, default None) – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

If level is specified, then, Series is returned; otherwise, scalar is returned.

Return type:

scalar or Series

See also

Series.all

Return True if all elements are True.

DataFrame.any

Return True if one (or more) elements are True.

Examples

Series

>>> pd.Series([True, True]).all()
True
>>> pd.Series([True, False]).all()
False
>>> pd.Series([], dtype="float64").all()
True
>>> pd.Series([np.nan]).all()
True
>>> pd.Series([np.nan]).all(skipna=False)
True

DataFrames

Create a dataframe from a dictionary.

>>> df = pd.DataFrame({'col1': [True, True], 'col2': [True, False]})
>>> df
   col1   col2
0  True   True
1  True  False

Default behaviour checks if values in each column all return True.

>>> df.all()
col1     True
col2    False
dtype: bool

Specify axis='columns' to check if values in each row all return True.

>>> df.all(axis='columns')
0     True
1    False
dtype: bool

Or axis=None for whether every value is True.

>>> df.all(axis=None)
False
cummax(axis=None, skipna=True, *args, **kwargs)

Return cumulative maximum over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative maximum.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative maximum of scalar or Series.

Return type:

scalar or Series

See also

core.window.expanding.Expanding.max

Similar functionality but ignores NaN values.

Series.max

Return the maximum over Series axis.

Series.cummax

Return cumulative maximum over Series axis.

Series.cummin

Return cumulative minimum over Series axis.

Series.cumsum

Return cumulative sum over Series axis.

Series.cumprod

Return cumulative product over Series axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummax()
0    2.0
1    NaN
2    5.0
3    5.0
4    5.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummax(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummax()
     A    B
0  2.0  1.0
1  3.0  NaN
2  3.0  1.0

To iterate over columns and find the maximum in each row, use axis=1

>>> df.cummax(axis=1)
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  1.0
cummin(axis=None, skipna=True, *args, **kwargs)

Return cumulative minimum over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative minimum.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative minimum of scalar or Series.

Return type:

scalar or Series

See also

core.window.expanding.Expanding.min

Similar functionality but ignores NaN values.

Series.min

Return the minimum over Series axis.

Series.cummax

Return cumulative maximum over Series axis.

Series.cummin

Return cumulative minimum over Series axis.

Series.cumsum

Return cumulative sum over Series axis.

Series.cumprod

Return cumulative product over Series axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cummin()
0    2.0
1    NaN
2    2.0
3   -1.0
4   -1.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cummin(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cummin()
     A    B
0  2.0  1.0
1  2.0  NaN
2  1.0  0.0

To iterate over columns and find the minimum in each row, use axis=1

>>> df.cummin(axis=1)
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0
cumprod(axis=None, skipna=True, *args, **kwargs)

Return cumulative product over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative product.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative product of scalar or Series.

Return type:

scalar or Series

See also

core.window.expanding.Expanding.prod

Similar functionality but ignores NaN values.

Series.prod

Return the product over Series axis.

Series.cummax

Return cumulative maximum over Series axis.

Series.cummin

Return cumulative minimum over Series axis.

Series.cumsum

Return cumulative sum over Series axis.

Series.cumprod

Return cumulative product over Series axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumprod()
0     2.0
1     NaN
2    10.0
3   -10.0
4    -0.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumprod(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumprod()
     A    B
0  2.0  1.0
1  6.0  NaN
2  6.0  0.0

To iterate over columns and find the product in each row, use axis=1

>>> df.cumprod(axis=1)
     A    B
0  2.0  2.0
1  3.0  NaN
2  1.0  0.0
cumsum(axis=None, skipna=True, *args, **kwargs)

Return cumulative sum over a DataFrame or Series axis.

Returns a DataFrame or Series of the same size containing the cumulative sum.

Parameters:
  • axis ({0 or 'index', 1 or 'columns'}, default 0) – The index or the name of the axis. 0 is equivalent to None or ‘index’. For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • *args – Additional keywords have no effect but might be accepted for compatibility with NumPy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with NumPy.

Returns:

Return cumulative sum of scalar or Series.

Return type:

scalar or Series

See also

core.window.expanding.Expanding.sum

Similar functionality but ignores NaN values.

Series.sum

Return the sum over Series axis.

Series.cummax

Return cumulative maximum over Series axis.

Series.cummin

Return cumulative minimum over Series axis.

Series.cumsum

Return cumulative sum over Series axis.

Series.cumprod

Return cumulative product over Series axis.

Examples

Series

>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0    2.0
1    NaN
2    5.0
3   -1.0
4    0.0
dtype: float64

By default, NA values are ignored.

>>> s.cumsum()
0    2.0
1    NaN
2    7.0
3    6.0
4    6.0
dtype: float64

To include NA values in the operation, use skipna=False

>>> s.cumsum(skipna=False)
0    2.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64

DataFrame

>>> df = pd.DataFrame([[2.0, 1.0],
...                    [3.0, np.nan],
...                    [1.0, 0.0]],
...                   columns=list('AB'))
>>> df
     A    B
0  2.0  1.0
1  3.0  NaN
2  1.0  0.0

By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or axis='index'.

>>> df.cumsum()
     A    B
0  2.0  1.0
1  5.0  NaN
2  6.0  1.0

To iterate over columns and find the sum in each row, use axis=1

>>> df.cumsum(axis=1)
     A    B
0  2.0  3.0
1  3.0  NaN
2  1.0  1.0
divide(other, level=None, fill_value=None, axis=0)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
divmod(other, level=None, fill_value=None, axis=0)

Return Integer division and modulo of series and other, element-wise (binary operator divmod).

Equivalent to divmod(series, other), but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

2-Tuple of Series

See also

Series.rdivmod

Reverse of the Integer division and modulo operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divmod(b, fill_value=0)
(a    1.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64,
 a    0.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64)
eq(other, level=None, fill_value=None, axis=0)

Return Equal to of series and other, element-wise (binary operator eq).

Equivalent to series == other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.eq(b, fill_value=0)
a     True
b    False
c    False
d    False
e    False
dtype: bool
floordiv(other, level=None, fill_value=None, axis=0)

Return Integer division of series and other, element-wise (binary operator floordiv).

Equivalent to series // other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rfloordiv

Reverse of the Integer division operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
ge(other, level=None, fill_value=None, axis=0)

Return Greater than or equal to of series and other, element-wise (binary operator ge).

Equivalent to series >= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

Examples

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.ge(b, fill_value=0)
a     True
b     True
c    False
d    False
e     True
f    False
dtype: bool
gt(other, level=None, fill_value=None, axis=0)

Return Greater than of series and other, element-wise (binary operator gt).

Equivalent to series > other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

Examples

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.gt(b, fill_value=0)
a     True
b    False
c    False
d    False
e     True
f    False
dtype: bool
kurt(axis=0, skipna=True, numeric_only=False, **kwargs)

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

kurtosis(axis=0, skipna=True, numeric_only=False, **kwargs)

Return unbiased kurtosis over requested axis.

Kurtosis obtained using Fisher’s definition of kurtosis (kurtosis of normal == 0.0). Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

le(other, level=None, fill_value=None, axis=0)

Return Less than or equal to of series and other, element-wise (binary operator le).

Equivalent to series <= other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

Examples

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.le(b, fill_value=0)
a    False
b     True
c     True
d    False
e    False
f     True
dtype: bool
lt(other, level=None, fill_value=None, axis=0)

Return Less than of series and other, element-wise (binary operator lt).

Equivalent to series < other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

Examples

>>> a = pd.Series([1, 1, 1, np.nan, 1], index=['a', 'b', 'c', 'd', 'e'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
e    1.0
dtype: float64
>>> b = pd.Series([0, 1, 2, np.nan, 1], index=['a', 'b', 'c', 'd', 'f'])
>>> b
a    0.0
b    1.0
c    2.0
d    NaN
f    1.0
dtype: float64
>>> a.lt(b, fill_value=0)
a    False
b    False
c     True
d    False
e    False
f     True
dtype: bool
max(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the maximum of the values over the requested axis.

If you want the index of the maximum, use idxmax. This is the equivalent of the numpy.ndarray method argmax.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.max()
8
mean(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the mean of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

median(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the median of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

memory_usage(index=True, deep=False)[source]

Return the memory usage of the Series.

The memory usage can optionally include the contribution of the index and of elements of object dtype.

Parameters:
  • index (bool, default True) – Specifies whether to include the memory usage of the Series index.

  • deep (bool, default False) – If True, introspect the data deeply by interrogating object dtypes for system-level memory consumption, and include it in the returned value.

Returns:

Bytes of memory consumed.

Return type:

int

See also

numpy.ndarray.nbytes

Total bytes consumed by the elements of the array.

DataFrame.memory_usage

Bytes consumed by a DataFrame.

Examples

>>> s = pd.Series(range(3))
>>> s.memory_usage()
152

Not including the index gives the size of the rest of the data, which is necessarily smaller:

>>> s.memory_usage(index=False)
24

The memory footprint of object values is ignored by default:

>>> s = pd.Series(["a", "b"])
>>> s.values
array(['a', 'b'], dtype=object)
>>> s.memory_usage()
144
>>> s.memory_usage(deep=True)
244
min(axis=0, skipna=True, numeric_only=False, **kwargs)

Return the minimum of the values over the requested axis.

If you want the index of the minimum, use idxmin. This is the equivalent of the numpy.ndarray method argmin.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.min()
0
mod(other, level=None, fill_value=None, axis=0)

Return Modulo of series and other, element-wise (binary operator mod).

Equivalent to series % other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rmod

Reverse of the Modulo operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64
mul(other, level=None, fill_value=None, axis=0)

Return Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rmul

Reverse of the Multiplication operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
multiply(other, level=None, fill_value=None, axis=0)

Return Multiplication of series and other, element-wise (binary operator mul).

Equivalent to series * other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rmul

Reverse of the Multiplication operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
ne(other, level=None, fill_value=None, axis=0)

Return Not equal to of series and other, element-wise (binary operator ne).

Equivalent to series != other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.ne(b, fill_value=0)
a    False
b     True
c     True
d     True
e     True
dtype: bool
pow(other, level=None, fill_value=None, axis=0)

Return Exponential power of series and other, element-wise (binary operator pow).

Equivalent to series ** other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rpow

Reverse of the Exponential power operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64
prod(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
product(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)

Return the product of the values over the requested axis.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

By default, the product of an empty or all-NA Series is 1

>>> pd.Series([], dtype="float64").prod()
1.0

This can be controlled with the min_count parameter

>>> pd.Series([], dtype="float64").prod(min_count=1)
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
radd(other, level=None, fill_value=None, axis=0)

Return Addition of series and other, element-wise (binary operator radd).

Equivalent to other + series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.add

Element-wise Addition, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.add(b, fill_value=0)
a    2.0
b    1.0
c    1.0
d    1.0
e    NaN
dtype: float64
rdivmod(other, level=None, fill_value=None, axis=0)

Return Integer division and modulo of series and other, element-wise (binary operator rdivmod).

Equivalent to other divmod series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

2-Tuple of Series

See also

Series.divmod

Element-wise Integer division and modulo, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divmod(b, fill_value=0)
(a    1.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64,
 a    0.0
 b    NaN
 c    NaN
 d    0.0
 e    NaN
 dtype: float64)
rfloordiv(other, level=None, fill_value=None, axis=0)

Return Integer division of series and other, element-wise (binary operator rfloordiv).

Equivalent to other // series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.floordiv

Element-wise Integer division, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.floordiv(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
rmod(other, level=None, fill_value=None, axis=0)

Return Modulo of series and other, element-wise (binary operator rmod).

Equivalent to other % series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.mod

Element-wise Modulo, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.mod(b, fill_value=0)
a    0.0
b    NaN
c    NaN
d    0.0
e    NaN
dtype: float64
rmul(other, level=None, fill_value=None, axis=0)

Return Multiplication of series and other, element-wise (binary operator rmul).

Equivalent to other * series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.mul

Element-wise Multiplication, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.multiply(b, fill_value=0)
a    1.0
b    0.0
c    0.0
d    0.0
e    NaN
dtype: float64
rpow(other, level=None, fill_value=None, axis=0)

Return Exponential power of series and other, element-wise (binary operator rpow).

Equivalent to other ** series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.pow

Element-wise Exponential power, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.pow(b, fill_value=0)
a    1.0
b    1.0
c    1.0
d    0.0
e    NaN
dtype: float64
rsub(other, level=None, fill_value=None, axis=0)

Return Subtraction of series and other, element-wise (binary operator rsub).

Equivalent to other - series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.sub

Element-wise Subtraction, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
rtruediv(other, level=None, fill_value=None, axis=0)

Return Floating division of series and other, element-wise (binary operator rtruediv).

Equivalent to other / series, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.truediv

Element-wise Floating division, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
sem(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)

Return unbiased standard error of the mean over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument

Parameters:
  • axis ({index (0)}) – For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

Return type:

scalar or Series (if level specified)

skew(axis=0, skipna=True, numeric_only=False, **kwargs)

Return unbiased skew over requested axis.

Normalized by N-1.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

std(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)

Return sample standard deviation over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0)}) – For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

Return type:

scalar or Series (if level specified)

Notes

To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                    'age': [21, 25, 62, 43],
...                    'height': [1.61, 1.87, 1.49, 2.01]}
...                   ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01

The standard deviation of the columns can be found as follows:

>>> df.std()
age       18.786076
height     0.237417
dtype: float64

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.std(ddof=0)
age       16.269219
height     0.205609
dtype: float64
sub(other, level=None, fill_value=None, axis=0)

Return Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rsub

Reverse of the Subtraction operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
subtract(other, level=None, fill_value=None, axis=0)

Return Subtraction of series and other, element-wise (binary operator sub).

Equivalent to series - other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rsub

Reverse of the Subtraction operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.subtract(b, fill_value=0)
a    0.0
b    1.0
c    1.0
d   -1.0
e    NaN
dtype: float64
sum(axis=None, skipna=True, numeric_only=False, min_count=0, **kwargs)

Return the sum of the values over the requested axis.

This is equivalent to the method numpy.sum.

Parameters:
  • axis ({index (0)}) –

    Axis for the function to be applied on. For Series this parameter is unused and defaults to 0.

    For DataFrames, specifying axis=None will apply the aggregation across both axes.

    New in version 2.0.0.

  • skipna (bool, default True) – Exclude NA/null values when computing the result.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

  • min_count (int, default 0) – The required number of valid values to perform the operation. If fewer than min_count non-NA values are present the result will be NA.

  • **kwargs – Additional keyword arguments to be passed to the function.

Return type:

scalar or scalar

See also

Series.sum

Return the sum.

Series.min

Return the minimum.

Series.max

Return the maximum.

Series.idxmin

Return the index of the minimum.

Series.idxmax

Return the index of the maximum.

DataFrame.sum

Return the sum over the requested axis.

DataFrame.min

Return the minimum over the requested axis.

DataFrame.max

Return the maximum over the requested axis.

DataFrame.idxmin

Return the index of the minimum over the requested axis.

DataFrame.idxmax

Return the index of the maximum over the requested axis.

Examples

>>> idx = pd.MultiIndex.from_arrays([
...     ['warm', 'warm', 'cold', 'cold'],
...     ['dog', 'falcon', 'fish', 'spider']],
...     names=['blooded', 'animal'])
>>> s = pd.Series([4, 2, 0, 8], name='legs', index=idx)
>>> s
blooded  animal
warm     dog       4
         falcon    2
cold     fish      0
         spider    8
Name: legs, dtype: int64
>>> s.sum()
14

By default, the sum of an empty or all-NA Series is 0.

>>> pd.Series([], dtype="float64").sum()  # min_count=0 is the default
0.0

This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty series to be NaN, pass min_count=1.

>>> pd.Series([], dtype="float64").sum(min_count=1)
nan

Thanks to the skipna parameter, min_count handles all-NA and empty series identically.

>>> pd.Series([np.nan]).sum()
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
truediv(other, level=None, fill_value=None, axis=0)

Return Floating division of series and other, element-wise (binary operator truediv).

Equivalent to series / other, but with support to substitute a fill_value for missing data in either one of the inputs.

Parameters:
  • other (Series or scalar value) –

  • level (int or name) – Broadcast across a level, matching Index values on the passed MultiIndex level.

  • fill_value (None or float value, default None (NaN)) – Fill existing missing (NaN) values, and any new element needed for successful Series alignment, with this value before computation. If data in both corresponding Series locations is missing the result of filling (at that location) will be missing.

  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

Returns:

The result of the operation.

Return type:

Series

See also

Series.rtruediv

Reverse of the Floating division operator, see Python documentation for more details.

Examples

>>> a = pd.Series([1, 1, 1, np.nan], index=['a', 'b', 'c', 'd'])
>>> a
a    1.0
b    1.0
c    1.0
d    NaN
dtype: float64
>>> b = pd.Series([1, np.nan, 1, np.nan], index=['a', 'b', 'd', 'e'])
>>> b
a    1.0
b    NaN
d    1.0
e    NaN
dtype: float64
>>> a.divide(b, fill_value=0)
a    1.0
b    inf
c    inf
d    0.0
e    NaN
dtype: float64
var(axis=None, skipna=True, ddof=1, numeric_only=False, **kwargs)

Return unbiased variance over requested axis.

Normalized by N-1 by default. This can be changed using the ddof argument.

Parameters:
  • axis ({index (0)}) – For Series this parameter is unused and defaults to 0.

  • skipna (bool, default True) – Exclude NA/null values. If an entire row/column is NA, the result will be NA.

  • ddof (int, default 1) – Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements.

  • numeric_only (bool, default False) – Include only float, int, boolean columns. Not implemented for Series.

Return type:

scalar or Series (if level specified)

Examples

>>> df = pd.DataFrame({'person_id': [0, 1, 2, 3],
...                   'age': [21, 25, 62, 43],
...                   'height': [1.61, 1.87, 1.49, 2.01]}
...                  ).set_index('person_id')
>>> df
           age  height
person_id
0           21    1.61
1           25    1.87
2           62    1.49
3           43    2.01
>>> df.var()
age       352.916667
height      0.056367
dtype: float64

Alternatively, ddof=0 can be set to normalize by N instead of N-1:

>>> df.var(ddof=0)
age       264.687500
height      0.042275
dtype: float64
isin(values)[source]

Whether elements in Series are contained in values.

Return a boolean Series showing whether each element in the Series matches an element in the passed sequence of values exactly.

Parameters:

values (set or list-like) – The sequence of values to test. Passing in a single string will raise a TypeError. Instead, turn a single string into a list of one element.

Returns:

Series of booleans indicating if each element is in values.

Return type:

Series

Raises:

TypeError

  • If values is a string

See also

DataFrame.isin

Equivalent method on DataFrame.

Examples

>>> s = pd.Series(['lama', 'cow', 'lama', 'beetle', 'lama',
...                'hippo'], name='animal')
>>> s.isin(['cow', 'lama'])
0     True
1     True
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

To invert the boolean values, use the ~ operator:

>>> ~s.isin(['cow', 'lama'])
0    False
1    False
2    False
3     True
4    False
5     True
Name: animal, dtype: bool

Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:

>>> s.isin(['lama'])
0     True
1    False
2     True
3    False
4     True
5    False
Name: animal, dtype: bool

Strings and integers are distinct and are therefore not comparable:

>>> pd.Series([1]).isin(['1'])
0    False
dtype: bool
>>> pd.Series([1.1]).isin(['1.1'])
0    False
dtype: bool
between(left, right, inclusive='both')[source]

Return boolean Series equivalent to left <= series <= right.

This function returns a boolean vector containing True wherever the corresponding Series element is between the boundary values left and right. NA values are treated as False.

Parameters:
  • left (scalar or list-like) – Left boundary.

  • right (scalar or list-like) – Right boundary.

  • inclusive ({"both", "neither", "left", "right"}) –

    Include boundaries. Whether to set each bound as closed or open.

    Changed in version 1.3.0.

Returns:

Series representing whether each element is between left and right (inclusive).

Return type:

Series

See also

Series.gt

Greater than of series and other.

Series.lt

Less than of series and other.

Notes

This function is equivalent to (left <= ser) & (ser <= right)

Examples

>>> s = pd.Series([2, 0, 4, 8, np.nan])

Boundary values are included by default:

>>> s.between(1, 4)
0     True
1    False
2     True
3    False
4    False
dtype: bool

With inclusive set to "neither" boundary values are excluded:

>>> s.between(1, 4, inclusive="neither")
0     True
1    False
2    False
3    False
4    False
dtype: bool

left and right can be any scalar value:

>>> s = pd.Series(['Alice', 'Bob', 'Carol', 'Eve'])
>>> s.between('Anna', 'Daniel')
0    False
1     True
2     True
3    False
dtype: bool
isna()[source]

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in Series that indicates whether an element is an NA value.

Return type:

Series

See also

Series.isnull

Alias of isna.

Series.notna

Boolean inverse of isna.

Series.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
dtype: bool
isnull()[source]

Series.isnull is an alias for Series.isna.

Detect missing values.

Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True).

Returns:

Mask of bool values for each element in Series that indicates whether an element is an NA value.

Return type:

Series

See also

Series.isnull

Alias of isna.

Series.notna

Boolean inverse of isna.

Series.dropna

Omit axes labels with missing values.

isna

Top-level isna.

Examples

Show which entries in a DataFrame are NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.isna()
     age   born   name    toy
0  False   True  False   True
1  False  False  False  False
2   True  False  False  False

Show which entries in a Series are NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.isna()
0    False
1    False
2     True
dtype: bool
notna()[source]

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is not an NA value.

Return type:

Series

See also

Series.notnull

Alias of notna.

Series.isna

Boolean inverse of notna.

Series.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
dtype: bool
notnull()[source]

Series.notnull is an alias for Series.notna.

Detect existing (non-missing) values.

Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN, get mapped to False values.

Returns:

Mask of bool values for each element in Series that indicates whether an element is not an NA value.

Return type:

Series

See also

Series.notnull

Alias of notna.

Series.isna

Boolean inverse of notna.

Series.dropna

Omit axes labels with missing values.

notna

Top-level notna.

Examples

Show which entries in a DataFrame are not NA.

>>> df = pd.DataFrame(dict(age=[5, 6, np.NaN],
...                        born=[pd.NaT, pd.Timestamp('1939-05-27'),
...                              pd.Timestamp('1940-04-25')],
...                        name=['Alfred', 'Batman', ''],
...                        toy=[None, 'Batmobile', 'Joker']))
>>> df
   age       born    name        toy
0  5.0        NaT  Alfred       None
1  6.0 1939-05-27  Batman  Batmobile
2  NaN 1940-04-25              Joker
>>> df.notna()
     age   born  name    toy
0   True  False  True  False
1   True   True  True   True
2  False   True  True   True

Show which entries in a Series are not NA.

>>> ser = pd.Series([5, 6, np.NaN])
>>> ser
0    5.0
1    6.0
2    NaN
dtype: float64
>>> ser.notna()
0     True
1     True
2    False
dtype: bool
dropna(*, axis: int | Literal['index', 'columns', 'rows'] = 0, inplace: Literal[False] = False, how: Literal['any', 'all'] | None = None, ignore_index: bool = False) Series[source]
dropna(*, axis: int | Literal['index', 'columns', 'rows'] = 0, inplace: Literal[True], how: Literal['any', 'all'] | None = None, ignore_index: bool = False) None

Return a new Series with missing values removed.

See the User Guide for more on which values are considered missing, and how to work with missing data.

Parameters:
  • axis ({0 or 'index'}) – Unused. Parameter needed for compatibility with DataFrame.

  • inplace (bool, default False) – If True, do operation inplace and return None.

  • how (str, optional) – Not in use. Kept for compatibility.

  • ignore_index (bool, default False) –

    If True, the resulting axis will be labeled 0, 1, …, n - 1.

    New in version 2.0.0.

Returns:

Series with NA entries dropped from it or None if inplace=True.

Return type:

Series or None

See also

Series.isna

Indicate missing values.

Series.notna

Indicate existing (non-missing) values.

Series.fillna

Replace missing values.

DataFrame.dropna

Drop rows or columns which contain NA values.

Index.dropna

Drop missing indices.

Examples

>>> ser = pd.Series([1., 2., np.nan])
>>> ser
0    1.0
1    2.0
2    NaN
dtype: float64

Drop NA values from a Series.

>>> ser.dropna()
0    1.0
1    2.0
dtype: float64

Empty strings are not considered NA values. None is considered an NA value.

>>> ser = pd.Series([np.NaN, 2, pd.NaT, '', None, 'I stay'])
>>> ser
0       NaN
1         2
2       NaT
3
4      None
5    I stay
dtype: object
>>> ser.dropna()
1         2
3
5    I stay
dtype: object
asfreq(freq, method=None, how=None, normalize=False, fill_value=None)[source]

Convert time series to specified frequency.

Returns the original data conformed to a new index with the specified frequency.

If the index of this Series is a PeriodIndex, the new index is the result of transforming the original index with PeriodIndex.asfreq (so the original index will map one-to-one to the new index).

Otherwise, the new index will be equivalent to pd.date_range(start, end, freq=freq) where start and end are, respectively, the first and last entries in the original index (see pandas.date_range()). The values corresponding to any timesteps in the new index which were not present in the original index will be null (NaN), unless a method for filling such unknowns is provided (see the method parameter below).

The resample() method is more appropriate if an operation on each group of timesteps (such as an aggregate) is necessary to represent the data at the new frequency.

Parameters:
  • freq (DateOffset or str) – Frequency DateOffset or string.

  • method ({'backfill'/'bfill', 'pad'/'ffill'}, default None) –

    Method to use for filling holes in reindexed Series (note this does not fill NaNs that already were present):

    • ’pad’ / ‘ffill’: propagate last valid observation forward to next valid

    • ’backfill’ / ‘bfill’: use NEXT valid observation to fill.

  • how ({'start', 'end'}, default end) – For PeriodIndex only (see PeriodIndex.asfreq).

  • normalize (bool, default False) – Whether to reset output index to midnight.

  • fill_value (scalar, optional) – Value to use for missing values, applied during upsampling (note this does not fill NaNs that already were present).

Returns:

Series object reindexed to the specified frequency.

Return type:

Series

See also

reindex

Conform DataFrame to new index with optional filling logic.

Notes

To learn more about the frequency strings, please see this link.

Examples

Start by creating a series with 4 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=4, freq='T')
>>> series = pd.Series([0.0, None, 2.0, 3.0], index=index)
>>> df = pd.DataFrame({'s': series})
>>> df
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:01:00    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:03:00    3.0

Upsample the series into 30 second bins.

>>> df.asfreq(freq='30S')
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    NaN
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    NaN
2000-01-01 00:03:00    3.0

Upsample again, providing a fill value.

>>> df.asfreq(freq='30S', fill_value=9.0)
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    9.0
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    9.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    9.0
2000-01-01 00:03:00    3.0

Upsample again, providing a method.

>>> df.asfreq(freq='30S', method='bfill')
                       s
2000-01-01 00:00:00    0.0
2000-01-01 00:00:30    NaN
2000-01-01 00:01:00    NaN
2000-01-01 00:01:30    2.0
2000-01-01 00:02:00    2.0
2000-01-01 00:02:30    3.0
2000-01-01 00:03:00    3.0
resample(rule, axis=0, closed=None, label=None, convention='start', kind=None, on=None, level=None, origin='start_day', offset=None, group_keys=False)[source]

Resample time-series data.

Convenience method for frequency conversion and resampling of time series. The object must have a datetime-like index (DatetimeIndex, PeriodIndex, or TimedeltaIndex), or the caller must pass the label of a datetime-like series/index to the on/level keyword parameter.

Parameters:
  • rule (DateOffset, Timedelta or str) – The offset string or object representing target conversion.

  • axis ({0 or 'index', 1 or 'columns'}, default 0) – Which axis to use for up- or down-sampling. For Series this parameter is unused and defaults to 0. Must be DatetimeIndex, TimedeltaIndex or PeriodIndex.

  • closed ({'right', 'left'}, default None) – Which side of bin interval is closed. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

  • label ({'right', 'left'}, default None) – Which bin edge label to label bucket with. The default is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’, ‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.

  • convention ({'start', 'end', 's', 'e'}, default 'start') – For PeriodIndex only, controls whether to use the start or end of rule.

  • kind ({'timestamp', 'period'}, optional, default None) – Pass ‘timestamp’ to convert the resulting index to a DateTimeIndex or ‘period’ to convert it to a PeriodIndex. By default the input representation is retained.

  • on (str, optional) – For a DataFrame, column to use instead of index for resampling. Column must be datetime-like.

  • level (str or int, optional) – For a MultiIndex, level (name or number) to use for resampling. level must be datetime-like.

  • origin (Timestamp or str, default 'start_day') –

    The timestamp on which to adjust the grouping. The timezone of origin must match the timezone of the index. If string, must be one of the following:

    • ’epoch’: origin is 1970-01-01

    • ’start’: origin is the first value of the timeseries

    • ’start_day’: origin is the first day at midnight of the timeseries

    New in version 1.1.0.

    • ’end’: origin is the last value of the timeseries

    • ’end_day’: origin is the ceiling midnight of the last day

    New in version 1.3.0.

  • offset (Timedelta or str, default is None) –

    An offset timedelta added to the origin.

    New in version 1.1.0.

  • group_keys (bool, default False) –

    Whether to include the group keys in the result index when using .apply() on the resampled object.

    New in version 1.5.0: Not specifying group_keys will retain values-dependent behavior from pandas 1.4 and earlier (see pandas 1.5.0 Release notes for examples).

    Changed in version 2.0.0: group_keys now defaults to False.

Returns:

Resampler object.

Return type:

pandas.core.Resampler

See also

Series.resample

Resample a Series.

DataFrame.resample

Resample a DataFrame.

groupby

Group Series by mapping, function, label, or list of labels.

asfreq

Reindex a Series with the given frequency without grouping.

Notes

See the user guide for more.

To learn more about the offset strings, please see this link.

Examples

Start by creating a series with 9 one minute timestamps.

>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00    0
2000-01-01 00:01:00    1
2000-01-01 00:02:00    2
2000-01-01 00:03:00    3
2000-01-01 00:04:00    4
2000-01-01 00:05:00    5
2000-01-01 00:06:00    6
2000-01-01 00:07:00    7
2000-01-01 00:08:00    8
Freq: T, dtype: int64

Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.

>>> series.resample('3T').sum()
2000-01-01 00:00:00     3
2000-01-01 00:03:00    12
2000-01-01 00:06:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels. For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval as illustrated in the example below this one.

>>> series.resample('3T', label='right').sum()
2000-01-01 00:03:00     3
2000-01-01 00:06:00    12
2000-01-01 00:09:00    21
Freq: 3T, dtype: int64

Downsample the series into 3 minute bins as above, but close the right side of the bin interval.

>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00     0
2000-01-01 00:03:00     6
2000-01-01 00:06:00    15
2000-01-01 00:09:00    15
Freq: 3T, dtype: int64

Upsample the series into 30 second bins.

>>> series.resample('30S').asfreq()[0:5]   # Select first 5 rows
2000-01-01 00:00:00   0.0
2000-01-01 00:00:30   NaN
2000-01-01 00:01:00   1.0
2000-01-01 00:01:30   NaN
2000-01-01 00:02:00   2.0
Freq: 30S, dtype: float64

Upsample the series into 30 second bins and fill the NaN values using the ffill method.

>>> series.resample('30S').ffill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    0
2000-01-01 00:01:00    1
2000-01-01 00:01:30    1
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Upsample the series into 30 second bins and fill the NaN values using the bfill method.

>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00    0
2000-01-01 00:00:30    1
2000-01-01 00:01:00    1
2000-01-01 00:01:30    2
2000-01-01 00:02:00    2
Freq: 30S, dtype: int64

Pass a custom function via apply

>>> def custom_resampler(arraylike):
...     return np.sum(arraylike) + 5
...
>>> series.resample('3T').apply(custom_resampler)
2000-01-01 00:00:00     8
2000-01-01 00:03:00    17
2000-01-01 00:06:00    26
Freq: 3T, dtype: int64

For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or end of rule.

Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.

>>> s = pd.Series([1, 2], index=pd.period_range('2012-01-01',
...                                             freq='A',
...                                             periods=2))
>>> s
2012    1
2013    2
Freq: A-DEC, dtype: int64
>>> s.resample('Q', convention='start').asfreq()
2012Q1    1.0
2012Q2    NaN
2012Q3    NaN
2012Q4    NaN
2013Q1    2.0
2013Q2    NaN
2013Q3    NaN
2013Q4    NaN
Freq: Q-DEC, dtype: float64

Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.

>>> q = pd.Series([1, 2, 3, 4], index=pd.period_range('2018-01-01',
...                                                   freq='Q',
...                                                   periods=4))
>>> q
2018Q1    1
2018Q2    2
2018Q3    3
2018Q4    4
Freq: Q-DEC, dtype: int64
>>> q.resample('M', convention='end').asfreq()
2018-03    1.0
2018-04    NaN
2018-05    NaN
2018-06    2.0
2018-07    NaN
2018-08    NaN
2018-09    3.0
2018-10    NaN
2018-11    NaN
2018-12    4.0
Freq: M, dtype: float64

For DataFrame objects, the keyword on can be used to specify the column instead of the index for resampling.

>>> d = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
...      'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df = pd.DataFrame(d)
>>> df['week_starting'] = pd.date_range('01/01/2018',
...                                     periods=8,
...                                     freq='W')
>>> df
   price  volume week_starting
0     10      50    2018-01-07
1     11      60    2018-01-14
2      9      40    2018-01-21
3     13     100    2018-01-28
4     14      50    2018-02-04
5     18     100    2018-02-11
6     17      40    2018-02-18
7     19      50    2018-02-25
>>> df.resample('M', on='week_starting').mean()
               price  volume
week_starting
2018-01-31     10.75    62.5
2018-02-28     17.00    60.0

For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling needs to take place.

>>> days = pd.date_range('1/1/2000', periods=4, freq='D')
>>> d2 = {'price': [10, 11, 9, 13, 14, 18, 17, 19],
...       'volume': [50, 60, 40, 100, 50, 100, 40, 50]}
>>> df2 = pd.DataFrame(
...     d2,
...     index=pd.MultiIndex.from_product(
...         [days, ['morning', 'afternoon']]
...     )
... )
>>> df2
                      price  volume
2000-01-01 morning       10      50
           afternoon     11      60
2000-01-02 morning        9      40
           afternoon     13     100
2000-01-03 morning       14      50
           afternoon     18     100
2000-01-04 morning       17      40
           afternoon     19      50
>>> df2.resample('D', level=0).sum()
            price  volume
2000-01-01     21     110
2000-01-02     22     140
2000-01-03     32     150
2000-01-04     36      90

If you want to adjust the start of the bins based on a fixed timestamp:

>>> start, end = '2000-10-01 23:30:00', '2000-10-02 00:30:00'
>>> rng = pd.date_range(start, end, freq='7min')
>>> ts = pd.Series(np.arange(len(rng)) * 3, index=rng)
>>> ts
2000-10-01 23:30:00     0
2000-10-01 23:37:00     3
2000-10-01 23:44:00     6
2000-10-01 23:51:00     9
2000-10-01 23:58:00    12
2000-10-02 00:05:00    15
2000-10-02 00:12:00    18
2000-10-02 00:19:00    21
2000-10-02 00:26:00    24
Freq: 7T, dtype: int64
>>> ts.resample('17min').sum()
2000-10-01 23:14:00     0
2000-10-01 23:31:00     9
2000-10-01 23:48:00    21
2000-10-02 00:05:00    54
2000-10-02 00:22:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='epoch').sum()
2000-10-01 23:18:00     0
2000-10-01 23:35:00    18
2000-10-01 23:52:00    27
2000-10-02 00:09:00    39
2000-10-02 00:26:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', origin='2000-01-01').sum()
2000-10-01 23:24:00     3
2000-10-01 23:41:00    15
2000-10-01 23:58:00    45
2000-10-02 00:15:00    45
Freq: 17T, dtype: int64

If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:

>>> ts.resample('17min', origin='start').sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64
>>> ts.resample('17min', offset='23h30min').sum()
2000-10-01 23:30:00     9
2000-10-01 23:47:00    21
2000-10-02 00:04:00    54
2000-10-02 00:21:00    24
Freq: 17T, dtype: int64

If you want to take the largest Timestamp as the end of the bins:

>>> ts.resample('17min', origin='end').sum()
2000-10-01 23:35:00     0
2000-10-01 23:52:00    18
2000-10-02 00:09:00    27
2000-10-02 00:26:00    63
Freq: 17T, dtype: int64

In contrast with the start_day, you can use end_day to take the ceiling midnight of the largest Timestamp as the end of the bins and drop the bins not containing data:

>>> ts.resample('17min', origin='end_day').sum()
2000-10-01 23:38:00     3
2000-10-01 23:55:00    15
2000-10-02 00:12:00    45
2000-10-02 00:29:00    45
Freq: 17T, dtype: int64
to_timestamp(freq=None, how='start', copy=None)[source]

Cast to DatetimeIndex of Timestamps, at beginning of period.

Parameters:
  • freq (str, default frequency of PeriodIndex) – Desired frequency.

  • how ({'s', 'e', 'start', 'end'}) – Convention for converting period to timestamp; start of period vs. end.

  • copy (bool, default True) – Whether or not to return a copy.

Return type:

Series with DatetimeIndex

Examples

>>> idx = pd.PeriodIndex(['2023', '2024', '2025'], freq='Y')
>>> s1 = pd.Series([1, 2, 3], index=idx)
>>> s1
2023    1
2024    2
2025    3
Freq: A-DEC, dtype: int64

The resulting frequency of the Timestamps is YearBegin

>>> s1 = s1.to_timestamp()
>>> s1
2023-01-01    1
2024-01-01    2
2025-01-01    3
Freq: AS-JAN, dtype: int64

Using freq which is the offset that the Timestamps will have

>>> s2 = pd.Series([1, 2, 3], index=idx)
>>> s2 = s2.to_timestamp(freq='M')
>>> s2
2023-01-31    1
2024-01-31    2
2025-01-31    3
Freq: A-JAN, dtype: int64
to_period(freq=None, copy=None)[source]

Convert Series from DatetimeIndex to PeriodIndex.

Parameters:
  • freq (str, default None) – Frequency associated with the PeriodIndex.

  • copy (bool, default True) – Whether or not to return a copy.

Returns:

Series with index converted to PeriodIndex.

Return type:

Series

Examples

>>> idx = pd.DatetimeIndex(['2023', '2024', '2025'])
>>> s = pd.Series([1, 2, 3], index=idx)
>>> s = s.to_period()
>>> s
2023    1
2024    2
2025    3
Freq: A-DEC, dtype: int64

Viewing the index

>>> s.index
PeriodIndex(['2023', '2024', '2025'], dtype='period[A-DEC]')
ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast: dict | None = None) Series[source]
ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast: dict | None = None) None
ffill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast: dict | None = None) Series | None

Synonym for DataFrame.fillna() with method='ffill'.

Returns:

Object with missing values filled or None if inplace=True.

Return type:

Series/DataFrame or None

bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[False] = False, limit: None | int = None, downcast: dict | None = None) Series[source]
bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: Literal[True], limit: None | int = None, downcast: dict | None = None) None
bfill(*, axis: None | int | Literal['index', 'columns', 'rows'] = None, inplace: bool = False, limit: None | int = None, downcast: dict | None = None) Series | None

Synonym for DataFrame.fillna() with method='bfill'.

Returns:

Object with missing values filled or None if inplace=True.

Return type:

Series/DataFrame or None

clip(lower=None, upper=None, *, axis=None, inplace=False, **kwargs)[source]

Trim values at input threshold(s).

Assigns values outside boundary to boundary values. Thresholds can be singular values or array like, and in the latter case the clipping is performed element-wise in the specified axis.

Parameters:
  • lower (float or array-like, default None) – Minimum threshold value. All values below this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • upper (float or array-like, default None) – Maximum threshold value. All values above this threshold will be set to it. A missing threshold (e.g NA) will not clip the value.

  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Align object with lower and upper along the given axis. For Series this parameter is unused and defaults to None.

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • *args – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • **kwargs – Additional keywords have no effect but might be accepted for compatibility with numpy.

  • self (Series) –

Returns:

Same type as calling object with the values outside the clip boundaries replaced or None if inplace=True.

Return type:

Series or DataFrame or None

See also

Series.clip

Trim values at input threshold in series.

DataFrame.clip

Trim values at input threshold in dataframe.

numpy.clip

Clip (limit) the values in an array.

Examples

>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
   col_0  col_1
0      9     -2
1     -3     -7
2      0      6
3     -1      8
4      5     -5

Clips per column using lower and upper thresholds:

>>> df.clip(-4, 6)
   col_0  col_1
0      6     -2
1     -3     -4
2      0      6
3     -1      6
4      5     -4

Clips using specific lower and upper thresholds per column element:

>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0    2
1   -4
2   -1
3    6
4    3
dtype: int64
>>> df.clip(t, t + 4, axis=0)
   col_0  col_1
0      6      2
1     -3     -4
2      0      3
3      6      8
4      5      3

Clips using specific lower threshold per column element, with missing values:

>>> t = pd.Series([2, -4, np.NaN, 6, 3])
>>> t
0    2.0
1   -4.0
2    NaN
3    6.0
4    3.0
dtype: float64
>>> df.clip(t, axis=0)
col_0  col_1
0      9      2
1     -3     -4
2      0      6
3      6      8
4      5      3
interpolate(method='linear', *, axis=0, limit=None, inplace=False, limit_direction=None, limit_area=None, downcast=None, **kwargs)[source]

Fill NaN values using an interpolation method.

Please note that only method='linear' is supported for DataFrame/Series with a MultiIndex.

Parameters:
  • method (str, default 'linear') –

    Interpolation technique to use. One of:

    • ’linear’: Ignore the index and treat the values as equally spaced. This is the only method supported on MultiIndexes.

    • ’time’: Works on daily and higher resolution data to interpolate given length of interval.

    • ’index’, ‘values’: use the actual numerical values of the index.

    • ’pad’: Fill in NaNs using existing values.

    • ’nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘barycentric’, ‘polynomial’: Passed to scipy.interpolate.interp1d, whereas ‘spline’ is passed to scipy.interpolate.UnivariateSpline. These methods use the numerical values of the index. Both ‘polynomial’ and ‘spline’ require that you also specify an order (int), e.g. df.interpolate(method='polynomial', order=5). Note that, slinear method in Pandas refers to the Scipy first order spline instead of Pandas first order spline.

    • ’krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’: Wrappers around the SciPy interpolation methods of similar names. See Notes.

    • ’from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.

  • axis ({{0 or 'index', 1 or 'columns', None}}, default None) – Axis to interpolate along. For Series this parameter is unused and defaults to 0.

  • limit (int, optional) – Maximum number of consecutive NaNs to fill. Must be greater than 0.

  • inplace (bool, default False) – Update the data in place if possible.

  • limit_direction ({{'forward', 'backward', 'both'}}, Optional) –

    Consecutive NaNs will be filled in this direction.

    If limit is specified:
    • If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.

    • If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.

    If ‘limit’ is not specified:
    • If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’

    • else the default is ‘forward’

    Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’ and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’ or ‘both’ and method is ‘pad’ or ‘ffill’.

  • limit_area ({{None, ‘inside’, ‘outside’}}, default None) –

    If limit is specified, consecutive NaNs will be filled with this restriction.

    • None: No fill restriction.

    • ’inside’: Only fill NaNs surrounded by valid values (interpolate).

    • ’outside’: Only fill NaNs outside valid values (extrapolate).

  • downcast (optional, 'infer' or None, defaults to None) – Downcast dtypes if possible.

  • **kwargs (optional) – Keyword arguments to pass on to the interpolating function.

  • self (Series) –

Returns:

Returns the same object type as the caller, interpolated at some or all NaN values or None if inplace=True.

Return type:

Series or DataFrame or None

See also

fillna

Fill missing values using different methods.

scipy.interpolate.Akima1DInterpolator

Piecewise cubic polynomials (Akima interpolator).

scipy.interpolate.BPoly.from_derivatives

Piecewise polynomial in the Bernstein basis.

scipy.interpolate.interp1d

Interpolate a 1-D function.

scipy.interpolate.KroghInterpolator

Interpolate polynomial (Krogh interpolator).

scipy.interpolate.PchipInterpolator

PCHIP 1-d monotonic cubic interpolation.

scipy.interpolate.CubicSpline

Cubic spline data interpolator.

Notes

The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respective SciPy implementations of similar names. These use the actual numerical values of the index. For more information on their behavior, see the SciPy documentation.

Examples

Filling in NaN in a Series via linear interpolation.

>>> s = pd.Series([0, 1, np.nan, 3])
>>> s
0    0.0
1    1.0
2    NaN
3    3.0
dtype: float64
>>> s.interpolate()
0    0.0
1    1.0
2    2.0
3    3.0
dtype: float64

Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.

>>> s = pd.Series([np.nan, "single_one", np.nan,
...                "fill_two_more", np.nan, np.nan, np.nan,
...                4.71, np.nan])
>>> s
0              NaN
1       single_one
2              NaN
3    fill_two_more
4              NaN
5              NaN
6              NaN
7             4.71
8              NaN
dtype: object
>>> s.interpolate(method='pad', limit=2)
0              NaN
1       single_one
2       single_one
3    fill_two_more
4    fill_two_more
5    fill_two_more
6              NaN
7             4.71
8             4.71
dtype: object

Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods require that you also specify an order (int).

>>> s = pd.Series([0, 2, np.nan, 8])
>>> s.interpolate(method='polynomial', order=2)
0    0.000000
1    2.000000
2    4.666667
3    8.000000
dtype: float64

Fill the DataFrame forward (that is, going down) along each column using linear interpolation.

Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is no entry before it to use for interpolation.

>>> df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
...                    (np.nan, 2.0, np.nan, np.nan),
...                    (2.0, 3.0, np.nan, 9.0),
...                    (np.nan, 4.0, -4.0, 16.0)],
...                   columns=list('abcd'))
>>> df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0
>>> df.interpolate(method='linear', limit_direction='forward', axis=0)
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

Using polynomial interpolation.

>>> df['d'].interpolate(method='polynomial', order=2)
0     1.0
1     4.0
2     9.0
3    16.0
Name: d, dtype: float64
where(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series[source]
where(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
where(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series | None

Replace values where the condition is False.

Parameters:
  • cond (bool Series/DataFrame, array-like, or callable) – Where cond is True, keep the original value. Where False, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

  • other (scalar, Series/DataFrame, or callable) – Entries where cond is False are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes).

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.

  • level (int, default None) – Alignment level if needed.

Return type:

Same type as caller or None if inplace=True.

See also

DataFrame.mask()

Return an object of same shape as self.

Notes

The where method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is True the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with False.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the where documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))
>>> t = pd.Series([True, False])
>>> s.where(t, 99)
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
mask(cond, other=_NoDefault.no_default, *, inplace: Literal[False] = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series[source]
mask(cond, other=_NoDefault.no_default, *, inplace: Literal[True], axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) None
mask(cond, other=_NoDefault.no_default, *, inplace: bool = False, axis: int | Literal['index', 'columns', 'rows'] | None = None, level: Hashable = None) Series | None

Replace values where the condition is True.

Parameters:
  • cond (bool Series/DataFrame, array-like, or callable) – Where cond is False, keep the original value. Where True, replace with corresponding value from other. If cond is callable, it is computed on the Series/DataFrame and should return boolean Series/DataFrame or array. The callable must not change input Series/DataFrame (though pandas doesn’t check it).

  • other (scalar, Series/DataFrame, or callable) – Entries where cond is True are replaced with corresponding value from other. If other is callable, it is computed on the Series/DataFrame and should return scalar or Series/DataFrame. The callable must not change input Series/DataFrame (though pandas doesn’t check it). If not specified, entries will be filled with the corresponding NULL value (np.nan for numpy dtypes, pd.NA for extension dtypes).

  • inplace (bool, default False) – Whether to perform the operation in place on the data.

  • axis (int, default None) – Alignment axis if needed. For Series this parameter is unused and defaults to 0.

  • level (int, default None) – Alignment level if needed.

Return type:

Same type as caller or None if inplace=True.

See also

DataFrame.where()

Return an object of same shape as self.

Notes

The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if cond is False the element is used; otherwise the corresponding element from the DataFrame other is used. If the axis of other does not align with axis of cond Series/DataFrame, the misaligned index positions will be filled with True.

The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m, df2) is equivalent to np.where(m, df1, df2).

For further details and examples see the mask documentation in indexing.

The dtype of the object takes precedence. The fill value is casted to the object’s dtype, if this can be done losslessly.

Examples

>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
>>> s.mask(s > 0)
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
>>> s = pd.Series(range(5))
>>> t = pd.Series([True, False])
>>> s.where(t, 99)
0     0
1    99
2    99
3    99
4    99
dtype: int64
>>> s.mask(t, 99)
0    99
1     1
2    99
3    99
4    99
dtype: int64
>>> s.where(s > 1, 10)
0    10
1    10
2    2
3    3
4    4
dtype: int64
>>> s.mask(s > 1, 10)
0     0
1     1
2    10
3    10
4    10
dtype: int64
>>> df = pd.DataFrame(np.arange(10).reshape(-1, 2), columns=['A', 'B'])
>>> df
   A  B
0  0  1
1  2  3
2  4  5
3  6  7
4  8  9
>>> m = df % 3 == 0
>>> df.where(m, -df)
   A  B
0  0 -1
1 -2  3
2 -4 -5
3  6 -7
4 -8  9
>>> df.where(m, -df) == np.where(m, df, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
>>> df.where(m, -df) == df.mask(~m, -df)
      A     B
0  True  True
1  True  True
2  True  True
3  True  True
4  True  True
index

The index (axis labels) of the Series.

str

alias of StringMethods

dt

alias of CombinedDatetimelikeProperties

cat

alias of CategoricalAccessor

plot

alias of PlotAccessor

sparse

alias of SparseAccessor

hist(by=None, ax=None, grid=True, xlabelsize=None, xrot=None, ylabelsize=None, yrot=None, figsize=None, bins=10, backend=None, legend=False, **kwargs)

Draw histogram of the input series using matplotlib.

Parameters:
  • by (object, optional) – If passed, then used to form histograms for separate groups.

  • ax (matplotlib axis object) – If not passed, uses gca().

  • grid (bool, default True) – Whether to show axis grid lines.

  • xlabelsize (int, default None) – If specified changes the x-axis label size.

  • xrot (float, default None) – Rotation of x axis labels.

  • ylabelsize (int, default None) – If specified changes the y-axis label size.

  • yrot (float, default None) – Rotation of y axis labels.

  • figsize (tuple, default None) – Figure size in inches by default.

  • bins (int or sequence, default 10) – Number of histogram bins to be used. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified.

  • backend (str, default None) – Backend to use instead of the backend specified in the option plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify the plotting.backend for the whole session, set pd.options.plotting.backend.

  • legend (bool, default False) –

    Whether to show the legend.

    New in version 1.1.0.

  • **kwargs – To be passed to the actual plotting function.

Returns:

A histogram plot.

Return type:

matplotlib.AxesSubplot

See also

matplotlib.axes.Axes.hist

Plot a histogram using matplotlib.

class pandas.SparseDtype[source]

Dtype for data stored in SparseArray.

This dtype implements the pandas ExtensionDtype interface.

Parameters:
  • dtype (str, ExtensionDtype, numpy.dtype, type, default numpy.float64) – The dtype of the underlying array storing the non-fill value values.

  • fill_value (scalar, optional) –

    The scalar value not stored in the SparseArray. By default, this depends on dtype.

    dtype

    na_value

    float

    np.nan

    int

    0

    bool

    False

    datetime64

    pd.NaT

    timedelta64

    pd.NaT

    The default value may be overridden by specifying a fill_value.

None
None()
property fill_value

The fill value of the array.

Converting the SparseArray to a dense ndarray will fill the array with this value.

Warning

It’s possible to end up with a SparseArray that has fill_value values in sp_values. This can occur, for example, when setting SparseArray.fill_value directly.

property kind: str

The sparse kind. Either ‘integer’, or ‘block’.

property type

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

property subtype
property name: str

A string identifying the data type.

Will be used for display in, e.g. Series.dtype

classmethod construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

classmethod construct_from_string(string)[source]

Construct a SparseDtype from a string form.

Parameters:

string (str) –

Can take the following forms.

string dtype ================ ============================ ‘int’ SparseDtype[np.int64, 0] ‘Sparse’ SparseDtype[np.float64, nan] ‘Sparse[int]’ SparseDtype[np.int64, 0] ‘Sparse[int, 0]’ SparseDtype[np.int64, 0] ================ ============================

It is not possible to specify non-default fill values with a string. An argument like 'Sparse[int, 1]' will raise a TypeError because the default fill value for integers is 0.

Return type:

SparseDtype

classmethod is_dtype(dtype)[source]

Check if we match ‘dtype’.

Parameters:

dtype (object) – The object to check.

Return type:

bool

Notes

The default implementation is True if

  1. cls.construct_from_string(dtype) is an instance of cls.

  2. dtype is an object and is an instance of cls

  3. dtype has a dtype attribute, and any of the above conditions is true for dtype.dtype.

update_dtype(dtype)[source]

Convert the SparseDtype to a new dtype.

This takes care of converting the fill_value.

Parameters:

dtype (Union[str, numpy.dtype, SparseDtype]) –

The new dtype to use.

  • For a SparseDtype, it is simply returned

  • For a NumPy dtype (or str), the current fill value is converted to the new dtype, and a SparseDtype with dtype and the new fill value is returned.

Returns:

A new SparseDtype with the correct dtype and fill value for that dtype.

Return type:

SparseDtype

Raises:

ValueError – When the current fill value cannot be converted to the new dtype (e.g. trying to convert np.nan to an integer dtype).

Examples

>>> SparseDtype(int, 0).update_dtype(float)
Sparse[float64, 0.0]
>>> SparseDtype(int, 1).update_dtype(SparseDtype(float, np.nan))
Sparse[float64, nan]
class pandas.StringDtype[source]

Extension dtype for string data.

Warning

StringDtype is considered experimental. The implementation and parts of the API may change without warning.

Parameters:

storage ({"python", "pyarrow"}, optional) – If not given, the value of pd.options.mode.string_storage.

None
None()

Examples

>>> pd.StringDtype()
string[python]
>>> pd.StringDtype(storage="pyarrow")
string[pyarrow]
name: str = 'string'
property na_value: NAType

Default NA value to use for this type.

This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value, not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.

property type: type[str]

The scalar type for the array, e.g. int

It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar item, assuming that value is valid (not NA). NA values do not need to be instances of type.

classmethod construct_from_string(string)[source]

Construct a StringDtype from a string.

Parameters:

string (str) –

The type of the name. The storage type will be taking from string. Valid options and their storage types are

string

result storage

'string'

pd.options.mode.string_storage, default python

'string[python]'

python

'string[pyarrow]'

pyarrow

Return type:

StringDtype

Raises:

TypeError – If the string is not a valid option.

construct_array_type()[source]

Return the array type associated with this dtype.

Return type:

type

class pandas.Timedelta

Represents a duration, the difference between two dates or times.

Timedelta is the pandas equivalent of python’s datetime.timedelta and is interchangeable with it in most cases.

Parameters:
  • value (Timedelta, timedelta, np.timedelta64, str, or int) –

  • unit (str, default 'ns') –

    Denote the unit of the input, if input is an integer.

    Possible values:

    • ’W’, ‘D’, ‘T’, ‘S’, ‘L’, ‘U’, or ‘N’

    • ’days’ or ‘day’

    • ’hours’, ‘hour’, ‘hr’, or ‘h’

    • ’minutes’, ‘minute’, ‘min’, or ‘m’

    • ’seconds’, ‘second’, or ‘sec’

    • ’milliseconds’, ‘millisecond’, ‘millis’, or ‘milli’

    • ’microseconds’, ‘microsecond’, ‘micros’, or ‘micro’

    • ’nanoseconds’, ‘nanosecond’, ‘nanos’, ‘nano’, or ‘ns’.

  • **kwargs – Available kwargs: {days, seconds, microseconds, milliseconds, minutes, hours, weeks}. Values for construction in compat with datetime.timedelta. Numpy ints and floats will be coerced to python ints and floats.

Notes

The constructor may take in either both values of value and unit or kwargs as above. Either one of them must be used during initialization

The .value attribute is always in ns.

If the precision is higher than nanoseconds, the precision of the duration is truncated to nanoseconds.

Examples

Here we initialize Timedelta object with both value and unit

>>> td = pd.Timedelta(1, "d")
>>> td
Timedelta('1 days 00:00:00')

Here we initialize the Timedelta object with kwargs

>>> td2 = pd.Timedelta(days=1)
>>> td2
Timedelta('1 days 00:00:00')

We see that either way we get the same result

ceil(freq)

Return a new Timedelta ceiled to this resolution.

Parameters:

freq (str) – Frequency string indicating the ceiling resolution.

floor(freq)

Return a new Timedelta floored to this resolution.

Parameters:

freq (str) – Frequency string indicating the flooring resolution.

round(freq)

Round the Timedelta to the specified resolution.

Parameters:

freq (str) – Frequency string indicating the rounding resolution.

Return type:

a new Timedelta rounded to the given resolution of freq

Raises:

ValueError if the freq cannot be converted

class pandas.TimedeltaIndex[source]

Immutable Index of timedelta64 data.

Represented internally as int64, and scalars returned Timedelta objects.

Parameters:
  • data (array-like (1-dimensional), optional) – Optional timedelta-like data to construct index with.

  • unit (unit of the arg (D,h,m,s,ms,us,ns) denote the unit, optional) – Which is an integer/float number.

  • freq (str or pandas offset object, optional) – One of pandas date offset strings or corresponding objects. The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation.

  • copy (bool) – Make a copy of input ndarray.

  • name (object) – Name to be stored in the index.

days
seconds
microseconds
nanoseconds
components
inferred_freq
to_pytimedelta()
to_series()
round()
floor()
ceil()
to_frame()
mean()

See also

Index

The base pandas Index type.

Timedelta

Represents a duration between two dates or times.

DatetimeIndex

Index of datetime64 data.

PeriodIndex

Index of Period data.

timedelta_range

Create a fixed-frequency TimedeltaIndex.

Notes

To learn more about the frequency strings, please see this link.

get_loc(key)[source]

Get integer location for requested label

Returns:

loc

Return type:

int, slice, or ndarray[int]

property inferred_type: str

Return a string of the type inferred from the values.

ceil(*args, **kwargs)

Perform ceil operation on the data to the specified freq.

Parameters:
  • freq (str or Offset) – The frequency level to ceil the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    Only relevant for DatetimeIndex:

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.

Return type:

DatetimeIndex, TimedeltaIndex, or Series

Raises:

ValueError if the freq cannot be converted.

Notes

If the timestamps have a timezone, ceiling will take place relative to the local (“wall”) time and re-localized to the same timezone. When ceiling near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

DatetimeIndex

>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
>>> rng
DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
               '2018-01-01 12:01:00'],
              dtype='datetime64[ns]', freq='T')
>>> rng.ceil('H')
DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00',
               '2018-01-01 13:00:00'],
              dtype='datetime64[ns]', freq=None)

Series

>>> pd.Series(rng).dt.ceil("H")
0   2018-01-01 12:00:00
1   2018-01-01 12:00:00
2   2018-01-01 13:00:00
dtype: datetime64[ns]

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> rng_tz = pd.DatetimeIndex(["2021-10-31 01:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.ceil("H", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.ceil("H", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
property components

Return a DataFrame of the individual resolution components of the Timedeltas.

The components (days, hours, minutes seconds, milliseconds, microseconds, nanoseconds) are returned as columns in a DataFrame.

Return type:

DataFrame

property days

Number of days for each element.

floor(*args, **kwargs)

Perform floor operation on the data to the specified freq.

Parameters:
  • freq (str or Offset) – The frequency level to floor the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    Only relevant for DatetimeIndex:

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.

Return type:

DatetimeIndex, TimedeltaIndex, or Series

Raises:

ValueError if the freq cannot be converted.

Notes

If the timestamps have a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

DatetimeIndex

>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
>>> rng
DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
               '2018-01-01 12:01:00'],
              dtype='datetime64[ns]', freq='T')
>>> rng.floor('H')
DatetimeIndex(['2018-01-01 11:00:00', '2018-01-01 12:00:00',
               '2018-01-01 12:00:00'],
              dtype='datetime64[ns]', freq=None)

Series

>>> pd.Series(rng).dt.floor("H")
0   2018-01-01 11:00:00
1   2018-01-01 12:00:00
2   2018-01-01 12:00:00
dtype: datetime64[ns]

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
             dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
median(*args, **kwargs)
property microseconds

Number of microseconds (>= 0 and less than 1 second) for each element.

property nanoseconds

Number of nanoseconds (>= 0 and less than 1 microsecond) for each element.

round(*args, **kwargs)

Perform round operation on the data to the specified freq.

Parameters:
  • freq (str or Offset) – The frequency level to round the index to. Must be a fixed frequency like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq values.

  • ambiguous ('infer', bool-ndarray, 'NaT', default 'raise') –

    Only relevant for DatetimeIndex:

    • ’infer’ will attempt to infer fall dst-transition hours based on order

    • bool-ndarray where True signifies a DST time, False designates a non-DST time (note that this flag is only applicable for ambiguous times)

    • ’NaT’ will return NaT where there are ambiguous times

    • ’raise’ will raise an AmbiguousTimeError if there are ambiguous times.

  • nonexistent ('shift_forward', 'shift_backward', 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time

    • ’NaT’ will return NaT where there are nonexistent times

    • timedelta objects will shift nonexistent times by the timedelta

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

Index of the same type for a DatetimeIndex or TimedeltaIndex, or a Series with the same index for a Series.

Return type:

DatetimeIndex, TimedeltaIndex, or Series

Raises:

ValueError if the freq cannot be converted.

Notes

If the timestamps have a timezone, rounding will take place relative to the local (“wall”) time and re-localized to the same timezone. When rounding near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

DatetimeIndex

>>> rng = pd.date_range('1/1/2018 11:59:00', periods=3, freq='min')
>>> rng
DatetimeIndex(['2018-01-01 11:59:00', '2018-01-01 12:00:00',
               '2018-01-01 12:01:00'],
              dtype='datetime64[ns]', freq='T')
>>> rng.round('H')
DatetimeIndex(['2018-01-01 12:00:00', '2018-01-01 12:00:00',
               '2018-01-01 12:00:00'],
              dtype='datetime64[ns]', freq=None)

Series

>>> pd.Series(rng).dt.round("H")
0   2018-01-01 12:00:00
1   2018-01-01 12:00:00
2   2018-01-01 12:00:00
dtype: datetime64[ns]

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> rng_tz = pd.DatetimeIndex(["2021-10-31 03:30:00"], tz="Europe/Amsterdam")
>>> rng_tz.floor("2H", ambiguous=False)
DatetimeIndex(['2021-10-31 02:00:00+01:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
>>> rng_tz.floor("2H", ambiguous=True)
DatetimeIndex(['2021-10-31 02:00:00+02:00'],
              dtype='datetime64[ns, Europe/Amsterdam]', freq=None)
property seconds

Number of seconds (>= 0 and less than 1 day) for each element.

std(*args, **kwargs)
sum(*args, **kwargs)
to_pytimedelta(*args, **kwargs)

Return an ndarray of datetime.timedelta objects.

Return type:

numpy.ndarray

total_seconds(*args, **kwargs)

Return total duration of each element expressed in seconds.

This method is available directly on TimedeltaArray, TimedeltaIndex and on Series containing timedelta values under the .dt namespace.

Returns:

When the calling object is a TimedeltaArray, the return type is ndarray. When the calling object is a TimedeltaIndex, the return type is an Index with a float64 dtype. When the calling object is a Series, the return type is Series of type float64 whose index is the same as the original.

Return type:

ndarray, Index or Series

See also

datetime.timedelta.total_seconds

Standard library version of this method.

TimedeltaIndex.components

Return a DataFrame with components of each Timedelta.

Examples

Series

>>> s = pd.Series(pd.to_timedelta(np.arange(5), unit='d'))
>>> s
0   0 days
1   1 days
2   2 days
3   3 days
4   4 days
dtype: timedelta64[ns]
>>> s.dt.total_seconds()
0         0.0
1     86400.0
2    172800.0
3    259200.0
4    345600.0
dtype: float64

TimedeltaIndex

>>> idx = pd.to_timedelta(np.arange(5), unit='d')
>>> idx
TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'],
               dtype='timedelta64[ns]', freq=None)
>>> idx.total_seconds()
Index([0.0, 86400.0, 172800.0, 259200.0, 345600.0], dtype='float64')
class pandas.Timestamp

Pandas replacement for python datetime.datetime object.

Timestamp is the pandas equivalent of python’s Datetime and is interchangeable with it in most cases. It’s the type used for the entries that make up a DatetimeIndex, and other timeseries oriented data structures in pandas.

Parameters:
  • ts_input (datetime-like, str, int, float) – Value to be converted to Timestamp.

  • year (int) –

  • month (int) –

  • day (int) –

  • hour (int, optional, default 0) –

  • minute (int, optional, default 0) –

  • second (int, optional, default 0) –

  • microsecond (int, optional, default 0) –

  • tzinfo (datetime.tzinfo, optional, default None) –

  • nanosecond (int, optional, default 0) –

  • tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will have.

  • unit (str) –

    Unit used for conversion if ts_input is of type int or float. The valid values are ‘D’, ‘h’, ‘m’, ‘s’, ‘ms’, ‘us’, and ‘ns’. For example, ‘s’ means seconds and ‘ms’ means milliseconds.

    For float inputs, the result will be stored in nanoseconds, and the unit attribute will be set as 'ns'.

  • fold ({0, 1}, default None, keyword-only) –

    Due to daylight saving time, one wall clock time can occur twice when shifting from summer to winter time; fold describes whether the datetime-like corresponds to the first (0) or the second time (1) the wall clock hits the ambiguous time.

    New in version 1.1.0.

Notes

There are essentially three calling conventions for the constructor. The primary form accepts four parameters. They can be passed by position or keyword.

The other two forms mimic the parameters from datetime.datetime. They can be passed by either position or keyword, but not both mixed together.

Examples

Using the primary calling convention:

This converts a datetime-like string

>>> pd.Timestamp('2017-01-01T12')
Timestamp('2017-01-01 12:00:00')

This converts a float representing a Unix epoch in units of seconds

>>> pd.Timestamp(1513393355.5, unit='s')
Timestamp('2017-12-16 03:02:35.500000')

This converts an int representing a Unix-epoch in units of seconds and for a particular timezone

>>> pd.Timestamp(1513393355, unit='s', tz='US/Pacific')
Timestamp('2017-12-15 19:02:35-0800', tz='US/Pacific')

Using the other two forms that mimic the API for datetime.datetime:

>>> pd.Timestamp(2017, 1, 1, 12)
Timestamp('2017-01-01 12:00:00')
>>> pd.Timestamp(year=2017, month=1, day=1, hour=12)
Timestamp('2017-01-01 12:00:00')
astimezone(tz)

Convert timezone-aware Timestamp to another time zone.

Parameters:

tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will be converted to. None will remove timezone holding UTC time.

Returns:

converted

Return type:

Timestamp

Raises:

TypeError – If Timestamp is tz-naive.

Examples

Create a timestamp object with UTC timezone:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC')
>>> ts
Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')

Change to Tokyo timezone:

>>> ts.tz_convert(tz='Asia/Tokyo')
Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')

Can also use astimezone:

>>> ts.astimezone(tz='Asia/Tokyo')
Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')

Analogous for pd.NaT:

>>> pd.NaT.tz_convert(tz='Asia/Tokyo')
NaT
ceil(freq, ambiguous='raise', nonexistent='raise')

Return a new Timestamp ceiled to this resolution.

Parameters:
  • freq (str) – Frequency string indicating the ceiling resolution.

  • ambiguous (bool or {'raise', 'NaT'}, default 'raise') –

    The behavior is as follows:

    • bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).

    • ’NaT’ will return NaT for an ambiguous time.

    • ’raise’ will raise an AmbiguousTimeError for an ambiguous time.

  • nonexistent ({'raise', 'shift_forward', 'shift_backward, 'NaT', timedelta}, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time.

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time.

    • ’NaT’ will return NaT where there are nonexistent times.

    • timedelta objects will shift nonexistent times by the timedelta.

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Raises:

ValueError if the freq cannot be converted.

Notes

If the Timestamp has a timezone, ceiling will take place relative to the local (“wall”) time and re-localized to the same timezone. When ceiling near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

Create a timestamp object:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')

A timestamp can be ceiled using multiple frequency units:

>>> ts.ceil(freq='H') # hour
Timestamp('2020-03-14 16:00:00')
>>> ts.ceil(freq='T') # minute
Timestamp('2020-03-14 15:33:00')
>>> ts.ceil(freq='S') # seconds
Timestamp('2020-03-14 15:32:53')
>>> ts.ceil(freq='U') # microseconds
Timestamp('2020-03-14 15:32:52.192549')

freq can also be a multiple of a single unit, like ‘5T’ (i.e. 5 minutes):

>>> ts.ceil(freq='5T')
Timestamp('2020-03-14 15:35:00')

or a combination of multiple units, like ‘1H30T’ (i.e. 1 hour and 30 minutes):

>>> ts.ceil(freq='1H30T')
Timestamp('2020-03-14 16:30:00')

Analogous for pd.NaT:

>>> pd.NaT.ceil()
NaT

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> ts_tz = pd.Timestamp("2021-10-31 01:30:00").tz_localize("Europe/Amsterdam")
>>> ts_tz.ceil("H", ambiguous=False)
Timestamp('2021-10-31 02:00:00+0100', tz='Europe/Amsterdam')
>>> ts_tz.ceil("H", ambiguous=True)
Timestamp('2021-10-31 02:00:00+0200', tz='Europe/Amsterdam')
classmethod combine(date, time)

Combine date, time into datetime with same date and time fields.

Examples

>>> from datetime import date, time
>>> pd.Timestamp.combine(date(2020, 3, 14), time(15, 30, 15))
Timestamp('2020-03-14 15:30:15')
daysinmonth

Return the number of days in the month.

Return type:

int

Examples

>>> ts = pd.Timestamp(2020, 3, 14)
>>> ts.days_in_month
31
floor(freq, ambiguous='raise', nonexistent='raise')

Return a new Timestamp floored to this resolution.

Parameters:
  • freq (str) – Frequency string indicating the flooring resolution.

  • ambiguous (bool or {'raise', 'NaT'}, default 'raise') –

    The behavior is as follows:

    • bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).

    • ’NaT’ will return NaT for an ambiguous time.

    • ’raise’ will raise an AmbiguousTimeError for an ambiguous time.

  • nonexistent ({'raise', 'shift_forward', 'shift_backward, 'NaT', timedelta}, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time.

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time.

    • ’NaT’ will return NaT where there are nonexistent times.

    • timedelta objects will shift nonexistent times by the timedelta.

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Raises:

ValueError if the freq cannot be converted.

Notes

If the Timestamp has a timezone, flooring will take place relative to the local (“wall”) time and re-localized to the same timezone. When flooring near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

Create a timestamp object:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')

A timestamp can be floored using multiple frequency units:

>>> ts.floor(freq='H') # hour
Timestamp('2020-03-14 15:00:00')
>>> ts.floor(freq='T') # minute
Timestamp('2020-03-14 15:32:00')
>>> ts.floor(freq='S') # seconds
Timestamp('2020-03-14 15:32:52')
>>> ts.floor(freq='N') # nanoseconds
Timestamp('2020-03-14 15:32:52.192548651')

freq can also be a multiple of a single unit, like ‘5T’ (i.e. 5 minutes):

>>> ts.floor(freq='5T')
Timestamp('2020-03-14 15:30:00')

or a combination of multiple units, like ‘1H30T’ (i.e. 1 hour and 30 minutes):

>>> ts.floor(freq='1H30T')
Timestamp('2020-03-14 15:00:00')

Analogous for pd.NaT:

>>> pd.NaT.floor()
NaT

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> ts_tz = pd.Timestamp("2021-10-31 03:30:00").tz_localize("Europe/Amsterdam")
>>> ts_tz.floor("2H", ambiguous=False)
Timestamp('2021-10-31 02:00:00+0100', tz='Europe/Amsterdam')
>>> ts_tz.floor("2H", ambiguous=True)
Timestamp('2021-10-31 02:00:00+0200', tz='Europe/Amsterdam')
classmethod fromordinal(ordinal, tz=None)

Construct a timestamp from a a proleptic Gregorian ordinal.

Parameters:
  • ordinal (int) – Date corresponding to a proleptic Gregorian ordinal.

  • tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for the Timestamp.

Notes

By definition there cannot be any tz info on the ordinal itself.

Examples

>>> pd.Timestamp.fromordinal(737425)
Timestamp('2020-01-01 00:00:00')
classmethod fromtimestamp(ts)

Transform timestamp[, tz] to tz’s local time from POSIX timestamp.

Examples

>>> pd.Timestamp.fromtimestamp(1584199972)  
Timestamp('2020-03-14 15:32:52')

Note that the output may change depending on your local time.

isoweekday()

Return the day of the week represented by the date.

Monday == 1 … Sunday == 7.

classmethod now(tz=None)

Return new Timestamp object representing current time local to tz.

Parameters:

tz (str or timezone object, default None) – Timezone to localize to.

Examples

>>> pd.Timestamp.now()  
Timestamp('2020-11-16 22:06:16.378782')

Analogous for pd.NaT:

>>> pd.NaT.now()
NaT
replace(year=None, month=None, day=None, hour=None, minute=None, second=None, microsecond=None, nanosecond=None, tzinfo=<class 'object'>, fold=None)

Implements datetime.replace, handles nanoseconds.

Parameters:
  • year (int, optional) –

  • month (int, optional) –

  • day (int, optional) –

  • hour (int, optional) –

  • minute (int, optional) –

  • second (int, optional) –

  • microsecond (int, optional) –

  • nanosecond (int, optional) –

  • tzinfo (tz-convertible, optional) –

  • fold (int, optional) –

Return type:

Timestamp with fields replaced

Examples

Create a timestamp object:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC')
>>> ts
Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')

Replace year and the hour:

>>> ts.replace(year=1999, hour=10)
Timestamp('1999-03-14 10:32:52.192548651+0000', tz='UTC')

Replace timezone (not a conversion):

>>> import pytz
>>> ts.replace(tzinfo=pytz.timezone('US/Pacific'))
Timestamp('2020-03-14 15:32:52.192548651-0700', tz='US/Pacific')

Analogous for pd.NaT:

>>> pd.NaT.replace(tzinfo=pytz.timezone('US/Pacific'))
NaT
round(freq, ambiguous='raise', nonexistent='raise')

Round the Timestamp to the specified resolution.

Parameters:
  • freq (str) – Frequency string indicating the rounding resolution.

  • ambiguous (bool or {'raise', 'NaT'}, default 'raise') –

    The behavior is as follows:

    • bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).

    • ’NaT’ will return NaT for an ambiguous time.

    • ’raise’ will raise an AmbiguousTimeError for an ambiguous time.

  • nonexistent ({'raise', 'shift_forward', 'shift_backward, 'NaT', timedelta}, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time.

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time.

    • ’NaT’ will return NaT where there are nonexistent times.

    • timedelta objects will shift nonexistent times by the timedelta.

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Return type:

a new Timestamp rounded to the given resolution of freq

Raises:

ValueError if the freq cannot be converted

Notes

If the Timestamp has a timezone, rounding will take place relative to the local (“wall”) time and re-localized to the same timezone. When rounding near daylight savings time, use nonexistent and ambiguous to control the re-localization behavior.

Examples

Create a timestamp object:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')

A timestamp can be rounded using multiple frequency units:

>>> ts.round(freq='H') # hour
Timestamp('2020-03-14 16:00:00')
>>> ts.round(freq='T') # minute
Timestamp('2020-03-14 15:33:00')
>>> ts.round(freq='S') # seconds
Timestamp('2020-03-14 15:32:52')
>>> ts.round(freq='L') # milliseconds
Timestamp('2020-03-14 15:32:52.193000')

freq can also be a multiple of a single unit, like ‘5T’ (i.e. 5 minutes):

>>> ts.round(freq='5T')
Timestamp('2020-03-14 15:35:00')

or a combination of multiple units, like ‘1H30T’ (i.e. 1 hour and 30 minutes):

>>> ts.round(freq='1H30T')
Timestamp('2020-03-14 15:00:00')

Analogous for pd.NaT:

>>> pd.NaT.round()
NaT

When rounding near a daylight savings time transition, use ambiguous or nonexistent to control how the timestamp should be re-localized.

>>> ts_tz = pd.Timestamp("2021-10-31 01:30:00").tz_localize("Europe/Amsterdam")
>>> ts_tz.round("H", ambiguous=False)
Timestamp('2021-10-31 02:00:00+0100', tz='Europe/Amsterdam')
>>> ts_tz.round("H", ambiguous=True)
Timestamp('2021-10-31 02:00:00+0200', tz='Europe/Amsterdam')
strftime(format)

Return a formatted string of the Timestamp.

Parameters:

format (str) – Format string to convert Timestamp to string. See strftime documentation for more information on the format string: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior.

Examples

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')
>>> ts.strftime('%Y-%m-%d %X')
'2020-03-14 15:32:52'
classmethod strptime(string, format)

Function is not implemented. Use pd.to_datetime().

to_julian_date()

Convert TimeStamp to a Julian Date.

0 Julian date is noon January 1, 4713 BC.

Examples

>>> ts = pd.Timestamp('2020-03-14T15:32:52')
>>> ts.to_julian_date()
2458923.147824074
Return type:

float64

classmethod today(tz=None)

Return the current time in the local timezone.

This differs from datetime.today() in that it can be localized to a passed timezone.

Parameters:

tz (str or timezone object, default None) – Timezone to localize to.

Examples

>>> pd.Timestamp.today()    
Timestamp('2020-11-16 22:37:39.969883')

Analogous for pd.NaT:

>>> pd.NaT.today()
NaT
property tz

Alias for tzinfo.

Examples

>>> ts = pd.Timestamp(1584226800, unit='s', tz='Europe/Stockholm')
>>> ts.tz
<DstTzInfo 'Europe/Stockholm' CET+1:00:00 STD>
tz_convert(tz)

Convert timezone-aware Timestamp to another time zone.

Parameters:

tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will be converted to. None will remove timezone holding UTC time.

Returns:

converted

Return type:

Timestamp

Raises:

TypeError – If Timestamp is tz-naive.

Examples

Create a timestamp object with UTC timezone:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651', tz='UTC')
>>> ts
Timestamp('2020-03-14 15:32:52.192548651+0000', tz='UTC')

Change to Tokyo timezone:

>>> ts.tz_convert(tz='Asia/Tokyo')
Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')

Can also use astimezone:

>>> ts.astimezone(tz='Asia/Tokyo')
Timestamp('2020-03-15 00:32:52.192548651+0900', tz='Asia/Tokyo')

Analogous for pd.NaT:

>>> pd.NaT.tz_convert(tz='Asia/Tokyo')
NaT
tz_localize(tz, ambiguous='raise', nonexistent='raise')

Localize the Timestamp to a timezone.

Convert naive Timestamp to local time zone or remove timezone from timezone-aware Timestamp.

Parameters:
  • tz (str, pytz.timezone, dateutil.tz.tzfile or None) – Time zone for time which Timestamp will be converted to. None will remove timezone holding local time.

  • ambiguous (bool, 'NaT', default 'raise') –

    When clocks moved backward due to DST, ambiguous times may arise. For example in Central European Time (UTC+01), when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at 00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter dictates how ambiguous times should be handled.

    The behavior is as follows:

    • bool contains flags to determine if time is dst or not (note that this flag is only applicable for ambiguous fall dst dates).

    • ’NaT’ will return NaT for an ambiguous time.

    • ’raise’ will raise an AmbiguousTimeError for an ambiguous time.

  • nonexistent ('shift_forward', 'shift_backward, 'NaT', timedelta, default 'raise') –

    A nonexistent time does not exist in a particular timezone where clocks moved forward due to DST.

    The behavior is as follows:

    • ’shift_forward’ will shift the nonexistent time forward to the closest existing time.

    • ’shift_backward’ will shift the nonexistent time backward to the closest existing time.

    • ’NaT’ will return NaT where there are nonexistent times.

    • timedelta objects will shift nonexistent times by the timedelta.

    • ’raise’ will raise an NonExistentTimeError if there are nonexistent times.

Returns:

localized

Return type:

Timestamp

Raises:

TypeError – If the Timestamp is tz-aware and tz is not None.

Examples

Create a naive timestamp object:

>>> ts = pd.Timestamp('2020-03-14T15:32:52.192548651')
>>> ts
Timestamp('2020-03-14 15:32:52.192548651')

Add ‘Europe/Stockholm’ as timezone:

>>> ts.tz_localize(tz='Europe/Stockholm')
Timestamp('2020-03-14 15:32:52.192548651+0100', tz='Europe/Stockholm')

Analogous for pd.NaT:

>>> pd.NaT.tz_localize()
NaT
classmethod utcfromtimestamp(ts)

Construct a timezone-aware UTC datetime from a POSIX timestamp.

Notes

Timestamp.utcfromtimestamp behavior differs from datetime.utcfromtimestamp in returning a timezone-aware object.

Examples

>>> pd.Timestamp.utcfromtimestamp(1584199972)
Timestamp('2020-03-14 15:32:52+0000', tz='UTC')
classmethod utcnow()

Return a new Timestamp representing UTC day and time.

Examples

>>> pd.Timestamp.utcnow()   
Timestamp('2020-11-16 22:50:18.092888+0000', tz='UTC')
weekday()

Return the day of the week represented by the date.

Monday == 0 … Sunday == 6.

weekofyear

Return the week number of the year.

Return type:

int

Examples

>>> ts = pd.Timestamp(2020, 3, 14)
>>> ts.week
11
class pandas.UInt16Dtype[source]

An ExtensionDtype for uint16 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of uint16

name: str = 'UInt16'
class pandas.UInt32Dtype[source]

An ExtensionDtype for uint32 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of uint32

name: str = 'UInt32'
class pandas.UInt64Dtype[source]

An ExtensionDtype for uint64 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of uint64

name: str = 'UInt64'
class pandas.UInt8Dtype[source]

An ExtensionDtype for uint8 integer data.

Uses pandas.NA as its missing value, rather than numpy.nan.

None
None()
type

alias of uint8

name: str = 'UInt8'
pandas.array(data, dtype=None, copy=True)[source]

Create an array.

Parameters:
  • data (Sequence of objects) –

    The scalars inside data should be instances of the scalar type for dtype. It’s expected that data represents a 1-dimensional array of data.

    When data is an Index or Series, the underlying array will be extracted from data.

  • dtype (str, np.dtype, or ExtensionDtype, optional) –

    The dtype to use for the array. This may be a NumPy dtype or an extension type registered with pandas using pandas.api.extensions.register_extension_dtype().

    If not specified, there are two possibilities:

    1. When data is a Series, Index, or ExtensionArray, the dtype will be taken from the data.

    2. Otherwise, pandas will attempt to infer the dtype from the data.

    Note that when data is a NumPy array, data.dtype is not used for inferring the array type. This is because NumPy cannot represent all the types of data that can be held in extension arrays.

    Currently, pandas will infer an extension dtype for sequences of

    Scalar Type

    Array Type

    pandas.Interval

    pandas.arrays.IntervalArray

    pandas.Period

    pandas.arrays.PeriodArray

    datetime.datetime

    pandas.arrays.DatetimeArray

    datetime.timedelta

    pandas.arrays.TimedeltaArray

    int

    pandas.arrays.IntegerArray

    float

    pandas.arrays.FloatingArray

    str

    pandas.arrays.StringArray or pandas.arrays.ArrowStringArray

    bool

    pandas.arrays.BooleanArray

    The ExtensionArray created when the scalar type is str is determined by pd.options.mode.string_storage if the dtype is not explicitly given.

    For all other cases, NumPy’s usual inference rules will be used.

    Changed in version 1.2.0: Pandas now also infers nullable-floating dtype for float-like input data

  • copy (bool, default True) – Whether to copy the data, even if not necessary. Depending on the type of data, creating the new array may require copying data, even if copy=False.

Returns:

The newly created array.

Return type:

ExtensionArray

Raises:

ValueError – When data is not 1-dimensional.

See also

numpy.array

Construct a NumPy array.

Series

Construct a pandas Series.

Index

Construct a pandas Index.

arrays.PandasArray

ExtensionArray wrapping a NumPy array.

Series.array

Extract the array stored within a Series.

Notes

Omitting the dtype argument means pandas will attempt to infer the best array type from the values in the data. As new array types are added by pandas and 3rd party libraries, the “best” array type may change. We recommend specifying dtype to ensure that

  1. the correct array type for the data is returned

  2. the returned array type doesn’t change as new extension types are added by pandas and third-party libraries

Additionally, if the underlying memory representation of the returned array matters, we recommend specifying the dtype as a concrete object rather than a string alias or allowing it to be inferred. For example, a future version of pandas or a 3rd-party library may include a dedicated ExtensionArray for string data. In this event, the following would no longer return a arrays.PandasArray backed by a NumPy array.

>>> pd.array(['a', 'b'], dtype=str)
<PandasArray>
['a', 'b']
Length: 2, dtype: str32

This would instead return the new ExtensionArray dedicated for string data. If you really need the new array to be backed by a NumPy array, specify that in the dtype.

>>> pd.array(['a', 'b'], dtype=np.dtype("<U1"))
<PandasArray>
['a', 'b']
Length: 2, dtype: str32

Finally, Pandas has arrays that mostly overlap with NumPy

  • arrays.DatetimeArray

  • arrays.TimedeltaArray

When data with a datetime64[ns] or timedelta64[ns] dtype is passed, pandas will always return a DatetimeArray or TimedeltaArray rather than a PandasArray. This is for symmetry with the case of timezone-aware data, which NumPy does not natively support.

>>> pd.array(['2015', '2016'], dtype='datetime64[ns]')
<DatetimeArray>
['2015-01-01 00:00:00', '2016-01-01 00:00:00']
Length: 2, dtype: datetime64[ns]
>>> pd.array(["1H", "2H"], dtype='timedelta64[ns]')
<TimedeltaArray>
['0 days 01:00:00', '0 days 02:00:00']
Length: 2, dtype: timedelta64[ns]

Examples

If a dtype is not specified, pandas will infer the best dtype from the values. See the description of dtype for the types pandas infers for.

>>> pd.array([1, 2])
<IntegerArray>
[1, 2]
Length: 2, dtype: Int64
>>> pd.array([1, 2, np.nan])
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
>>> pd.array([1.1, 2.2])
<FloatingArray>
[1.1, 2.2]
Length: 2, dtype: Float64
>>> pd.array(["a", None, "c"])
<StringArray>
['a', <NA>, 'c']
Length: 3, dtype: string
>>> with pd.option_context("string_storage", "pyarrow"):
...     arr = pd.array(["a", None, "c"])
...
>>> arr
<ArrowStringArray>
['a', <NA>, 'c']
Length: 3, dtype: string
>>> pd.array([pd.Period('2000', freq="D"), pd.Period("2000", freq="D")])
<PeriodArray>
['2000-01-01', '2000-01-01']
Length: 2, dtype: period[D]

You can use the string alias for dtype

>>> pd.array(['a', 'b', 'a'], dtype='category')
['a', 'b', 'a']
Categories (2, object): ['a', 'b']

Or specify the actual dtype

>>> pd.array(['a', 'b', 'a'],
...          dtype=pd.CategoricalDtype(['a', 'b', 'c'], ordered=True))
['a', 'b', 'a']
Categories (3, object): ['a' < 'b' < 'c']

If pandas does not infer a dedicated extension type a arrays.PandasArray is returned.

>>> pd.array([1 + 1j, 3 + 2j])
<PandasArray>
[(1+1j), (3+2j)]
Length: 2, dtype: complex128

As mentioned in the “Notes” section, new extension types may be added in the future (by pandas or 3rd party libraries), causing the return value to no longer be a arrays.PandasArray. Specify the dtype as a NumPy dtype if you need to ensure there’s no future change in behavior.

>>> pd.array([1, 2], dtype=np.dtype("int32"))
<PandasArray>
[1, 2]
Length: 2, dtype: int32

data must be 1-dimensional. A ValueError is raised when the input has the wrong dimensionality.

>>> pd.array(1)
Traceback (most recent call last):
  ...
ValueError: Cannot pass scalar '1' to 'pandas.array'.
pandas.bdate_range(start=None, end=None, periods=None, freq='B', tz=None, normalize=True, name=None, weekmask=None, holidays=None, inclusive='both', **kwargs)[source]

Return a fixed frequency DatetimeIndex with business day as the default.

Parameters:
  • start (str or datetime-like, default None) – Left bound for generating dates.

  • end (str or datetime-like, default None) – Right bound for generating dates.

  • periods (int, default None) – Number of periods to generate.

  • freq (str, Timedelta, datetime.timedelta, or DateOffset, default 'B') – Frequency strings can have multiples, e.g. ‘5H’. The default is business daily (‘B’).

  • tz (str or None) – Time zone name for returning localized DatetimeIndex, for example Asia/Beijing.

  • normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.

  • name (str, default None) – Name of the resulting DatetimeIndex.

  • weekmask (str or None, default None) – Weekmask of valid business days, passed to numpy.busdaycalendar, only used when custom frequency strings are passed. The default value None is equivalent to ‘Mon Tue Wed Thu Fri’.

  • holidays (list-like or None, default None) – Dates to exclude from the set of valid business days, passed to numpy.busdaycalendar, only used when custom frequency strings are passed.

  • inclusive ({"both", "neither", "left", "right"}, default "both") –

    Include boundaries; Whether to set each bound as closed or open.

    New in version 1.4.0.

  • **kwargs – For compatibility. Has no effect on the result.

Return type:

DatetimeIndex

Notes

Of the four parameters: start, end, periods, and freq, exactly three must be specified. Specifying freq is a requirement for bdate_range. Use date_range if specifying freq is not desired.

To learn more about the frequency strings, please see this link.

Examples

Note how the two weekend days are skipped in the result.

>>> pd.bdate_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
           '2018-01-05', '2018-01-08'],
          dtype='datetime64[ns]', freq='B')
pandas.concat(objs, *, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=None)[source]

Concatenate pandas objects along a particular axis.

Allows optional set logic along the other axes.

Can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same (or overlapping) on the passed axis number.

Parameters:
  • objs (a sequence or mapping of Series or DataFrame objects) – If a mapping is passed, the sorted keys will be used as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None objects will be dropped silently unless they are all None in which case a ValueError will be raised.

  • axis ({0/'index', 1/'columns'}, default 0) – The axis to concatenate along.

  • join ({'inner', 'outer'}, default 'outer') – How to handle indexes on other axis (or axes).

  • ignore_index (bool, default False) – If True, do not use the index values along the concatenation axis. The resulting axis will be labeled 0, …, n - 1. This is useful if you are concatenating objects where the concatenation axis does not have meaningful indexing information. Note the index values on the other axes are still respected in the join.

  • keys (sequence, default None) – If multiple levels passed, should contain tuples. Construct hierarchical index using the passed keys as the outermost level.

  • levels (list of sequences, default None) – Specific levels (unique values) to use for constructing a MultiIndex. Otherwise they will be inferred from the keys.

  • names (list, default None) – Names for the levels in the resulting hierarchical index.

  • verify_integrity (bool, default False) – Check whether the new concatenated axis contains duplicates. This can be very expensive relative to the actual data concatenation.

  • sort (bool, default False) – Sort non-concatenation axis if it is not already aligned.

  • copy (bool, default True) – If False, do not copy data unnecessarily.

Returns:

When concatenating all Series along the index (axis=0), a Series is returned. When objs contains at least one DataFrame, a DataFrame is returned. When concatenating along the columns (axis=1), a DataFrame is returned.

Return type:

object, type of objs

See also

DataFrame.join

Join DataFrames using indexes.

DataFrame.merge

Merge DataFrames by indexes or columns.

Notes

The keys, levels, and names arguments are all optional.

A walkthrough of how this method fits in with other tools for combining pandas objects can be found here.

It is not recommended to build DataFrames by adding single rows in a for loop. Build a list of rows and make a DataFrame in a single concat.

Examples

Combine two Series.

>>> s1 = pd.Series(['a', 'b'])
>>> s2 = pd.Series(['c', 'd'])
>>> pd.concat([s1, s2])
0    a
1    b
0    c
1    d
dtype: object

Clear the existing index and reset it in the result by setting the ignore_index option to True.

>>> pd.concat([s1, s2], ignore_index=True)
0    a
1    b
2    c
3    d
dtype: object

Add a hierarchical index at the outermost level of the data with the keys option.

>>> pd.concat([s1, s2], keys=['s1', 's2'])
s1  0    a
    1    b
s2  0    c
    1    d
dtype: object

Label the index keys you create with the names option.

>>> pd.concat([s1, s2], keys=['s1', 's2'],
...           names=['Series name', 'Row ID'])
Series name  Row ID
s1           0         a
             1         b
s2           0         c
             1         d
dtype: object

Combine two DataFrame objects with identical columns.

>>> df1 = pd.DataFrame([['a', 1], ['b', 2]],
...                    columns=['letter', 'number'])
>>> df1
  letter  number
0      a       1
1      b       2
>>> df2 = pd.DataFrame([['c', 3], ['d', 4]],
...                    columns=['letter', 'number'])
>>> df2
  letter  number
0      c       3
1      d       4
>>> pd.concat([df1, df2])
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersection will be filled with NaN values.

>>> df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
...                    columns=['letter', 'number', 'animal'])
>>> df3
  letter  number animal
0      c       3    cat
1      d       4    dog
>>> pd.concat([df1, df3], sort=False)
  letter  number animal
0      a       1    NaN
1      b       2    NaN
0      c       3    cat
1      d       4    dog

Combine DataFrame objects with overlapping columns and return only those that are shared by passing inner to the join keyword argument.

>>> pd.concat([df1, df3], join="inner")
  letter  number
0      a       1
1      b       2
0      c       3
1      d       4

Combine DataFrame objects horizontally along the x axis by passing in axis=1.

>>> df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
...                    columns=['animal', 'name'])
>>> pd.concat([df1, df4], axis=1)
  letter  number  animal    name
0      a       1    bird   polly
1      b       2  monkey  george

Prevent the result from including duplicate index values with the verify_integrity option.

>>> df5 = pd.DataFrame([1], index=['a'])
>>> df5
   0
a  1
>>> df6 = pd.DataFrame([2], index=['a'])
>>> df6
   0
a  2
>>> pd.concat([df5, df6], verify_integrity=True)
Traceback (most recent call last):
    ...
ValueError: Indexes have overlapping values: ['a']

Append a single row to the end of a DataFrame object.

>>> df7 = pd.DataFrame({'a': 1, 'b': 2}, index=[0])
>>> df7
    a   b
0   1   2
>>> new_row = pd.Series({'a': 3, 'b': 4})
>>> new_row
a    3
b    4
dtype: int64
>>> pd.concat([df7, new_row.to_frame().T], ignore_index=True)
    a   b
0   1   2
1   3   4
pandas.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', dropna=True, normalize=False)[source]

Compute a simple cross tabulation of two (or more) factors.

By default, computes a frequency table of the factors unless an array of values and an aggregation function are passed.

Parameters:
  • index (array-like, Series, or list of arrays/Series) – Values to group by in the rows.

  • columns (array-like, Series, or list of arrays/Series) – Values to group by in the columns.

  • values (array-like, optional) – Array of values to aggregate according to the factors. Requires aggfunc be specified.

  • rownames (sequence, default None) – If passed, must match number of row arrays passed.

  • colnames (sequence, default None) – If passed, must match number of column arrays passed.

  • aggfunc (function, optional) – If specified, requires values be specified as well.

  • margins (bool, default False) – Add row/column margins (subtotals).

  • margins_name (str, default 'All') – Name of the row/column that will contain the totals when margins is True.

  • dropna (bool, default True) – Do not include columns whose entries are all NaN.

  • normalize (bool, {'all', 'index', 'columns'}, or {0,1}, default False) –

    Normalize by dividing all values by the sum of values.

    • If passed ‘all’ or True, will normalize over all values.

    • If passed ‘index’ will normalize over each row.

    • If passed ‘columns’ will normalize over each column.

    • If margins is True, will also normalize margin values.

Returns:

Cross tabulation of the data.

Return type:

DataFrame

See also

DataFrame.pivot

Reshape data based on column values.

pivot_table

Create a pivot table as a DataFrame.

Notes

Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are specified.

Any input passed containing Categorical data will have all of its categories included in the cross-tabulation, even if the actual data does not contain any instances of a particular category.

In the event that there aren’t overlapping indexes an empty DataFrame will be returned.

Reference the user guide for more examples.

Examples

>>> a = np.array(["foo", "foo", "foo", "foo", "bar", "bar",
...               "bar", "bar", "foo", "foo", "foo"], dtype=object)
>>> b = np.array(["one", "one", "one", "two", "one", "one",
...               "one", "two", "two", "two", "one"], dtype=object)
>>> c = np.array(["dull", "dull", "shiny", "dull", "dull", "shiny",
...               "shiny", "dull", "shiny", "shiny", "shiny"],
...              dtype=object)
>>> pd.crosstab(a, [b, c], rownames=['a'], colnames=['b', 'c'])
b   one        two
c   dull shiny dull shiny
a
bar    1     2    1     0
foo    2     2    1     2

Here ‘c’ and ‘f’ are not represented in the data and will not be shown in the output because dropna is True by default. Set dropna=False to preserve categories with no data.

>>> foo = pd.Categorical(['a', 'b'], categories=['a', 'b', 'c'])
>>> bar = pd.Categorical(['d', 'e'], categories=['d', 'e', 'f'])
>>> pd.crosstab(foo, bar)
col_0  d  e
row_0
a      1  0
b      0  1
>>> pd.crosstab(foo, bar, dropna=False)
col_0  d  e  f
row_0
a      1  0  0
b      0  1  0
c      0  0  0
pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)[source]

Bin values into discrete intervals.

Use cut when you need to segment and sort data values into bins. This function is also useful for going from a continuous variable to a categorical variable. For example, cut could convert ages to groups of age ranges. Supports binning into an equal number of bins, or a pre-specified array of bins.

Parameters:
  • x (array-like) – The input array to be binned. Must be 1-dimensional.

  • bins (int, sequence of scalars, or IntervalIndex) –

    The criteria to bin by.

    • int : Defines the number of equal-width bins in the range of x. The range of x is extended by .1% on each side to include the minimum and maximum values of x.

    • sequence of scalars : Defines the bin edges allowing for non-uniform width. No extension of the range of x is done.

    • IntervalIndex : Defines the exact bins to be used. Note that IntervalIndex for bins must be non-overlapping.

  • right (bool, default True) – Indicates whether bins includes the rightmost edge or not. If right == True (the default), then the bins [1, 2, 3, 4] indicate (1,2], (2,3], (3,4]. This argument is ignored when bins is an IntervalIndex.

  • labels (array or False, default None) – Specifies the labels for the returned bins. Must be the same length as the resulting bins. If False, returns only integer indicators of the bins. This affects the type of the output container (see below). This argument is ignored when bins is an IntervalIndex. If True, raises an error. When ordered=False, labels must be provided.

  • retbins (bool, default False) – Whether to return the bins or not. Useful when bins is provided as a scalar.

  • precision (int, default 3) – The precision at which to store and display the bins labels.

  • include_lowest (bool, default False) – Whether the first interval should be left-inclusive or not.

  • duplicates ({default 'raise', 'drop'}, optional) – If bin edges are not unique, raise ValueError or drop non-uniques.

  • ordered (bool, default True) –

    Whether the labels are ordered or not. Applies to returned types Categorical and Series (with Categorical dtype). If True, the resulting categorical will be ordered. If False, the resulting categorical will be unordered (labels must be provided).

    New in version 1.1.0.

Returns:

  • out (Categorical, Series, or ndarray) – An array-like object representing the respective bin for each value of x. The type depends on the value of labels.

    • None (default) : returns a Series for Series x or a Categorical for all other inputs. The values stored within are Interval dtype.

    • sequence of scalars : returns a Series for Series x or a Categorical for all other inputs. The values stored within are whatever the type in the sequence is.

    • False : returns an ndarray of integers.

  • bins (numpy.ndarray or IntervalIndex.) – The computed or specified bins. Only returned when retbins=True. For scalar or sequence bins, this is an ndarray with the computed bins. If set duplicates=drop, bins will drop non-unique bin. For an IntervalIndex bins, this is equal to bins.

See also

qcut

Discretize variable into equal-sized buckets based on rank or based on sample quantiles.

Categorical

Array type for storing data that come from a fixed set of values.

Series

One-dimensional array with axis labels (including time series).

IntervalIndex

Immutable Index implementing an ordered, sliceable set.

Notes

Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or Categorical object.

Reference the user guide for more examples.

Examples

Discretize into three equal-sized bins.

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3)
... 
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3, retbins=True)
... 
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64, right]): [(0.994, 3.0] < (3.0, 5.0] ...
array([0.994, 3.   , 5.   , 7.   ]))

Discovers the same bins, but assign them specific labels. Notice that the returned Categorical’s categories are labels and is ordered.

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]),
...        3, labels=["bad", "medium", "good"])
['bad', 'good', 'medium', 'medium', 'good', 'bad']
Categories (3, object): ['bad' < 'medium' < 'good']

ordered=False will result in unordered categories when labels are passed. This parameter can be used to allow non-unique labels:

>>> pd.cut(np.array([1, 7, 5, 4, 6, 3]), 3,
...        labels=["B", "A", "B"], ordered=False)
['B', 'B', 'A', 'A', 'B', 'B']
Categories (2, object): ['A', 'B']

labels=False implies you just want the bins back.

>>> pd.cut([0, 1, 1, 2], bins=4, labels=False)
array([0, 1, 1, 3])

Passing a Series as an input returns a Series with categorical dtype:

>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
...               index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
... 
a    (1.992, 4.667]
b    (1.992, 4.667]
c    (4.667, 7.333]
d     (7.333, 10.0]
e     (7.333, 10.0]
dtype: category
Categories (3, interval[float64, right]): [(1.992, 4.667] < (4.667, ...

Passing a Series as an input returns a Series with mapping value. It is used to map numerically to intervals based on bins.

>>> s = pd.Series(np.array([2, 4, 6, 8, 10]),
...               index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, [0, 2, 4, 6, 8, 10], labels=False, retbins=True, right=False)
... 
(a    1.0
 b    2.0
 c    3.0
 d    4.0
 e    NaN
 dtype: float64,
 array([ 0,  2,  4,  6,  8, 10]))

Use drop optional when bins is not unique

>>> pd.cut(s, [0, 2, 4, 6, 10, 10], labels=False, retbins=True,
...        right=False, duplicates='drop')
... 
(a    1.0
 b    2.0
 c    3.0
 d    3.0
 e    NaN
 dtype: float64,
 array([ 0,  2,  4,  6, 10]))

Passing an IntervalIndex for bins results in those categories exactly. Notice that values not covered by the IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5 falls between two bins.

>>> bins = pd.IntervalIndex.from_tuples([(0, 1), (2, 3), (4, 5)])
>>> pd.cut([0, 0.5, 1.5, 2.5, 4.5], bins)
[NaN, (0.0, 1.0], NaN, (2.0, 3.0], (4.0, 5.0]]
Categories (3, interval[int64, right]): [(0, 1] < (2, 3] < (4, 5]]
pandas.date_range(start=None, end=None, periods=None, freq=None, tz=None, normalize=False, name=None, inclusive='both', *, unit=None, **kwargs)[source]

Return a fixed frequency DatetimeIndex.

Returns the range of equally spaced time points (where the difference between any two adjacent points is specified by the given frequency) such that they all satisfy start <[=] x <[=] end, where the first one and the last one are, resp., the first and last time points in that range that fall on the boundary of freq (if given as a frequency string) or that are valid for freq (if given as a pandas.tseries.offsets.DateOffset). (If exactly one of start, end, or freq is not specified, this missing parameter can be computed given periods, the number of timesteps in the range. See the note below.)

Parameters:
  • start (str or datetime-like, optional) – Left bound for generating dates.

  • end (str or datetime-like, optional) – Right bound for generating dates.

  • periods (int, optional) – Number of periods to generate.

  • freq (str, datetime.timedelta, or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’. See here for a list of frequency aliases.

  • tz (str or tzinfo, optional) – Time zone name for returning localized DatetimeIndex, for example ‘Asia/Hong_Kong’. By default, the resulting DatetimeIndex is timezone-naive unless timezone-aware datetime-likes are passed.

  • normalize (bool, default False) – Normalize start/end dates to midnight before generating date range.

  • name (str, default None) – Name of the resulting DatetimeIndex.

  • inclusive ({"both", "neither", "left", "right"}, default "both") –

    Include boundaries; Whether to set each bound as closed or open.

    New in version 1.4.0.

  • unit (str, default None) –

    Specify the desired resolution of the result.

    New in version 2.0.0.

  • **kwargs – For compatibility. Has no effect on the result.

Return type:

DatetimeIndex

See also

DatetimeIndex

An immutable container for datetimes.

timedelta_range

Return a fixed frequency TimedeltaIndex.

period_range

Return a fixed frequency PeriodIndex.

interval_range

Return a fixed frequency IntervalIndex.

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting DatetimeIndex will have periods linearly spaced elements between start and end (closed on both sides).

To learn more about the frequency strings, please see this link.

Examples

Specifying the values

The next four examples generate the same DatetimeIndex, but vary the combination of start, end and periods.

Specify start and end, with the default daily frequency.

>>> pd.date_range(start='1/1/2018', end='1/08/2018')
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', freq='D')

Specify timezone-aware start and end, with the default daily frequency.

>>> pd.date_range(
...     start=pd.to_datetime("1/1/2018").tz_localize("Europe/Berlin"),
...     end=pd.to_datetime("1/08/2018").tz_localize("Europe/Berlin"),
... )
DatetimeIndex(['2018-01-01 00:00:00+01:00', '2018-01-02 00:00:00+01:00',
               '2018-01-03 00:00:00+01:00', '2018-01-04 00:00:00+01:00',
               '2018-01-05 00:00:00+01:00', '2018-01-06 00:00:00+01:00',
               '2018-01-07 00:00:00+01:00', '2018-01-08 00:00:00+01:00'],
              dtype='datetime64[ns, Europe/Berlin]', freq='D')

Specify start and periods, the number of periods (days).

>>> pd.date_range(start='1/1/2018', periods=8)
DatetimeIndex(['2018-01-01', '2018-01-02', '2018-01-03', '2018-01-04',
               '2018-01-05', '2018-01-06', '2018-01-07', '2018-01-08'],
              dtype='datetime64[ns]', freq='D')

Specify end and periods, the number of periods (days).

>>> pd.date_range(end='1/1/2018', periods=8)
DatetimeIndex(['2017-12-25', '2017-12-26', '2017-12-27', '2017-12-28',
               '2017-12-29', '2017-12-30', '2017-12-31', '2018-01-01'],
              dtype='datetime64[ns]', freq='D')

Specify start, end, and periods; the frequency is generated automatically (linearly spaced).

>>> pd.date_range(start='2018-04-24', end='2018-04-27', periods=3)
DatetimeIndex(['2018-04-24 00:00:00', '2018-04-25 12:00:00',
               '2018-04-27 00:00:00'],
              dtype='datetime64[ns]', freq=None)

Other Parameters

Changed the freq (frequency) to 'M' (month end frequency).

>>> pd.date_range(start='1/1/2018', periods=5, freq='M')
DatetimeIndex(['2018-01-31', '2018-02-28', '2018-03-31', '2018-04-30',
               '2018-05-31'],
              dtype='datetime64[ns]', freq='M')

Multiples are allowed

>>> pd.date_range(start='1/1/2018', periods=5, freq='3M')
DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
               '2019-01-31'],
              dtype='datetime64[ns]', freq='3M')

freq can also be specified as an Offset object.

>>> pd.date_range(start='1/1/2018', periods=5, freq=pd.offsets.MonthEnd(3))
DatetimeIndex(['2018-01-31', '2018-04-30', '2018-07-31', '2018-10-31',
               '2019-01-31'],
              dtype='datetime64[ns]', freq='3M')

Specify tz to set the timezone.

>>> pd.date_range(start='1/1/2018', periods=5, tz='Asia/Tokyo')
DatetimeIndex(['2018-01-01 00:00:00+09:00', '2018-01-02 00:00:00+09:00',
               '2018-01-03 00:00:00+09:00', '2018-01-04 00:00:00+09:00',
               '2018-01-05 00:00:00+09:00'],
              dtype='datetime64[ns, Asia/Tokyo]', freq='D')

inclusive controls whether to include start and end that are on the boundary. The default, “both”, includes boundary points on either end.

>>> pd.date_range(start='2017-01-01', end='2017-01-04', inclusive="both")
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04'],
              dtype='datetime64[ns]', freq='D')

Use inclusive='left' to exclude end if it falls on the boundary.

>>> pd.date_range(start='2017-01-01', end='2017-01-04', inclusive='left')
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03'],
              dtype='datetime64[ns]', freq='D')

Use inclusive='right' to exclude start if it falls on the boundary, and similarly inclusive='neither' will exclude both start and end.

>>> pd.date_range(start='2017-01-01', end='2017-01-04', inclusive='right')
DatetimeIndex(['2017-01-02', '2017-01-03', '2017-01-04'],
              dtype='datetime64[ns]', freq='D')

Specify a unit

>>> pd.date_range(start="2017-01-01", periods=10, freq="100AS", unit="s")
DatetimeIndex(['2017-01-01', '2117-01-01', '2217-01-01', '2317-01-01',
               '2417-01-01', '2517-01-01', '2617-01-01', '2717-01-01',
               '2817-01-01', '2917-01-01'],
              dtype='datetime64[s]', freq='100AS-JAN')
pandas.eval(expr, parser='pandas', engine=None, local_dict=None, global_dict=None, resolvers=(), level=0, target=None, inplace=False)[source]

Evaluate a Python expression as a string using various backends.

The following arithmetic operations are supported: +, -, *, /, **, %, // (python engine only) along with the following boolean operations: | (or), & (and), and ~ (not). Additionally, the 'pandas' parser allows the use of and, or, and not with the same semantics as the corresponding bitwise operators. Series and DataFrame objects are supported and behave as they would with plain ol’ Python evaluation.

Parameters:
  • expr (str) – The expression to evaluate. This string cannot contain any Python statements, only Python expressions.

  • parser ({'pandas', 'python'}, default 'pandas') – The parser to use to construct the syntax tree from the expression. The default of 'pandas' parses code slightly different than standard Python. Alternatively, you can parse an expression using the 'python' parser to retain strict Python semantics. See the enhancing performance documentation for more details.

  • engine ({'python', 'numexpr'}, default 'numexpr') –

    The engine used to evaluate the expression. Supported engines are

    • None : tries to use numexpr, falls back to python

    • 'numexpr' : This default engine evaluates pandas objects using numexpr for large speed ups in complex expressions with large frames.

    • 'python' : Performs operations as if you had eval’d in top level python. This engine is generally not that useful.

    More backends may be available in the future.

  • local_dict (dict or None, optional) – A dictionary of local variables, taken from locals() by default.

  • global_dict (dict or None, optional) – A dictionary of global variables, taken from globals() by default.

  • resolvers (list of dict-like or None, optional) – A list of objects implementing the __getitem__ special method that you can use to inject an additional collection of namespaces to use for variable lookup. For example, this is used in the query() method to inject the DataFrame.index and DataFrame.columns variables that refer to their respective DataFrame instance attributes.

  • level (int, optional) – The number of prior stack frames to traverse and add to the current scope. Most users will not need to change this parameter.

  • target (object, optional, default None) – This is the target object for assignment. It is used when there is variable assignment in the expression. If so, then target must support item assignment with string keys, and if a copy is being returned, it must also support .copy().

  • inplace (bool, default False) – If target is provided, and the expression mutates target, whether to modify target inplace. Otherwise, return a copy of target with the mutation.

Returns:

The completion value of evaluating the given code or None if inplace=True.

Return type:

ndarray, numeric scalar, DataFrame, Series, or None

Raises:

ValueError – There are many instances where such an error can be raised: - target=None, but the expression is multiline. - The expression is multiline, but not all them have item assignment. An example of such an arrangement is this: a = b + 1 a + 2 Here, there are expressions on different lines, making it multiline, but the last line has no variable assigned to the output of a + 2. - inplace=True, but the expression is missing item assignment. - Item assignment is provided, but the target does not support string item assignment. - Item assignment is provided and inplace=False, but the target does not support the .copy() method

See also

DataFrame.query

Evaluates a boolean expression to query the columns of a frame.

DataFrame.eval

Evaluate a string describing operations on DataFrame columns.

Notes

The dtype of any objects involved in an arithmetic % operation are recursively cast to float64.

See the enhancing performance documentation for more details.

Examples

>>> df = pd.DataFrame({"animal": ["dog", "pig"], "age": [10, 20]})
>>> df
  animal  age
0    dog   10
1    pig   20

We can add a new column using pd.eval:

>>> pd.eval("double_age = df.age * 2", target=df)
  animal  age  double_age
0    dog   10          20
1    pig   20          40
pandas.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)[source]

Encode the object as an enumerated type or categorical variable.

This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function pandas.factorize(), and as a method Series.factorize() and Index.factorize().

Parameters:
  • values (sequence) – A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization.

  • sort (bool, default False) – Sort uniques and shuffle codes to maintain the relationship.

  • use_na_sentinel (bool, default True) –

    If True, the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values.

    New in version 1.5.0.

  • size_hint (int, optional) – Hint to the hashtable sizer.

Returns:

  • codes (ndarray) – An integer ndarray that’s an indexer into uniques. uniques.take(codes) will have the same values as values.

  • uniques (ndarray, Index, or Categorical) – The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned.

    Note

    Even if there’s a missing value in values, uniques will not contain an entry for it.

Return type:

tuple[np.ndarray, np.ndarray | Index]

See also

cut

Discretize continuous-valued array.

unique

Find the unique value in an array.

Notes

Reference the user guide for more examples.

Examples

These examples all show factorize as a top-level method like pd.factorize(values). The results are identical for methods like Series.factorize().

>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'])
>>> codes
array([0, 0, 1, 2, 0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

With sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the maintained.

>>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True)
>>> codes
array([1, 1, 0, 2, 1])
>>> uniques
array(['a', 'b', 'c'], dtype=object)

When use_na_sentinel=True (the default), missing values are indicated in the codes with the sentinel value -1 and missing values are not included in uniques.

>>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b'])
>>> codes
array([ 0, -1,  1,  2,  0])
>>> uniques
array(['b', 'a', 'c'], dtype=object)

Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.

>>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c'])
>>> codes, uniques = pd.factorize(cat)
>>> codes
array([0, 0, 1])
>>> uniques
['a', 'c']
Categories (3, object): ['a', 'b', 'c']

Notice that 'b' is in uniques.categories, despite not being present in cat.values.

For all other pandas objects, an Index of the appropriate type is returned.

>>> cat = pd.Series(['a', 'a', 'c'])
>>> codes, uniques = pd.factorize(cat)
>>> codes
array([0, 0, 1])
>>> uniques
Index(['a', 'c'], dtype='object')

If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting use_na_sentinel=False.

>>> values = np.array([1, 2, 1, np.nan])
>>> codes, uniques = pd.factorize(values)  # default: use_na_sentinel=True
>>> codes
array([ 0,  1,  0, -1])
>>> uniques
array([1., 2.])
>>> codes, uniques = pd.factorize(values, use_na_sentinel=False)
>>> codes
array([0, 1, 0, 2])
>>> uniques
array([ 1.,  2., nan])
pandas.get_dummies(data, prefix=None, prefix_sep='_', dummy_na=False, columns=None, sparse=False, drop_first=False, dtype=None)[source]

Convert categorical variable into dummy/indicator variables.

Each variable is converted in as many 0/1 variables as there are different values. Columns in the output are each named after a value; if the input is a DataFrame, the name of the original variable is prepended to the value.

Parameters:
  • data (array-like, Series, or DataFrame) – Data of which to get dummy indicators.

  • prefix (str, list of str, or dict of str, default None) – String to append DataFrame column names. Pass a list with length equal to the number of columns when calling get_dummies on a DataFrame. Alternatively, prefix can be a dictionary mapping column names to prefixes.

  • prefix_sep (str, default '_') – If appending prefix, separator/delimiter to use. Or pass a list or dictionary as with prefix.

  • dummy_na (bool, default False) – Add a column to indicate NaNs, if False NaNs are ignored.

  • columns (list-like, default None) – Column names in the DataFrame to be encoded. If columns is None then all the columns with object, string, or category dtype will be converted.

  • sparse (bool, default False) – Whether the dummy-encoded columns should be backed by a SparseArray (True) or a regular NumPy array (False).

  • drop_first (bool, default False) – Whether to get k-1 dummies out of k categorical levels by removing the first level.

  • dtype (dtype, default bool) – Data type for new columns. Only a single dtype is allowed.

Returns:

Dummy-coded data. If data contains other columns than the dummy-coded one(s), these will be prepended, unaltered, to the result.

Return type:

DataFrame

See also

Series.str.get_dummies

Convert Series of strings to dummy codes.

from_dummies()

Convert dummy codes to categorical DataFrame.

Notes

Reference the user guide for more examples.

Examples

>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False
>>> s1 = ['a', 'b', np.nan]
>>> pd.get_dummies(s1)
       a      b
0   True  False
1  False   True
2  False  False
>>> pd.get_dummies(s1, dummy_na=True)
       a      b    NaN
0   True  False  False
1  False   True  False
2  False  False   True
>>> df = pd.DataFrame({'A': ['a', 'b', 'a'], 'B': ['b', 'a', 'c'],
...                    'C': [1, 2, 3]})
>>> pd.get_dummies(df, prefix=['col1', 'col2'])
   C  col1_a  col1_b  col2_a  col2_b  col2_c
0  1    True   False   False    True   False
1  2   False    True    True   False   False
2  3    True   False   False   False    True
>>> pd.get_dummies(pd.Series(list('abcaa')))
       a      b      c
0   True  False  False
1  False   True  False
2  False  False   True
3   True  False  False
4   True  False  False
>>> pd.get_dummies(pd.Series(list('abcaa')), drop_first=True)
       b      c
0  False  False
1   True  False
2  False   True
3  False  False
4  False  False
>>> pd.get_dummies(pd.Series(list('abc')), dtype=float)
     a    b    c
0  1.0  0.0  0.0
1  0.0  1.0  0.0
2  0.0  0.0  1.0
pandas.from_dummies(data, sep=None, default_category=None)[source]

Create a categorical DataFrame from a DataFrame of dummy variables.

Inverts the operation performed by get_dummies().

New in version 1.5.0.

Parameters:
  • data (DataFrame) – Data which contains dummy-coded variables in form of integer columns of 1’s and 0’s.

  • sep (str, default None) – Separator used in the column names of the dummy categories they are character indicating the separation of the categorical names from the prefixes. For example, if your column names are ‘prefix_A’ and ‘prefix_B’, you can strip the underscore by specifying sep=’_’.

  • default_category (None, Hashable or dict of Hashables, default None) – The default category is the implied category when a value has none of the listed categories specified with a one, i.e. if all dummies in a row are zero. Can be a single value for all variables or a dict directly mapping the default categories to a prefix of a variable.

Returns:

Categorical data decoded from the dummy input-data.

Return type:

DataFrame

Raises:
  • ValueError

    • When the input DataFrame data contains NA values. * When the input DataFrame data contains column names with separators that do not match the separator specified with sep. * When a dict passed to default_category does not include an implied category for each prefix. * When a value in data has more than one category assigned to it. * When default_category=None and a value in data has no category assigned to it.

  • TypeError

    • When the input data is not of type DataFrame. * When the input DataFrame data contains non-dummy data. * When the passed sep is of a wrong data type. * When the passed default_category is of a wrong data type.

See also

get_dummies()

Convert Series or DataFrame to dummy codes.

Categorical

Represent a categorical variable in classic.

Notes

The columns of the passed dummy data should only include 1’s and 0’s, or boolean values.

Examples

>>> df = pd.DataFrame({"a": [1, 0, 0, 1], "b": [0, 1, 0, 0],
...                    "c": [0, 0, 1, 0]})
>>> df
   a  b  c
0  1  0  0
1  0  1  0
2  0  0  1
3  1  0  0
>>> pd.from_dummies(df)
0     a
1     b
2     c
3     a
>>> df = pd.DataFrame({"col1_a": [1, 0, 1], "col1_b": [0, 1, 0],
...                    "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
...                    "col2_c": [0, 0, 1]})
>>> df
      col1_a  col1_b  col2_a  col2_b  col2_c
0       1       0       0       1       0
1       0       1       1       0       0
2       1       0       0       0       1
>>> pd.from_dummies(df, sep="_")
    col1    col2
0    a       b
1    b       a
2    a       c
>>> df = pd.DataFrame({"col1_a": [1, 0, 0], "col1_b": [0, 1, 0],
...                    "col2_a": [0, 1, 0], "col2_b": [1, 0, 0],
...                    "col2_c": [0, 0, 0]})
>>> df
      col1_a  col1_b  col2_a  col2_b  col2_c
0       1       0       0       1       0
1       0       1       1       0       0
2       0       0       0       0       0
>>> pd.from_dummies(df, sep="_", default_category={"col1": "d", "col2": "e"})
    col1    col2
0    a       b
1    b       a
2    d       e
pandas.infer_freq(index)[source]

Infer the most likely frequency given the input index.

Parameters:

index (DatetimeIndex or TimedeltaIndex) – If passed a Series will use the values of the series (NOT THE INDEX).

Returns:

None if no discernible frequency.

Return type:

str or None

Raises:
  • TypeError – If the index is not datetime-like.

  • ValueError – If there are fewer than three values.

Examples

>>> idx = pd.date_range(start='2020/12/01', end='2020/12/30', periods=30)
>>> pd.infer_freq(idx)
'D'
pandas.interval_range(start=None, end=None, periods=None, freq=None, name=None, closed='right')[source]

Return a fixed frequency IntervalIndex.

Parameters:
  • start (numeric or datetime-like, default None) – Left bound for generating intervals.

  • end (numeric or datetime-like, default None) – Right bound for generating intervals.

  • periods (int, default None) – Number of periods to generate.

  • freq (numeric, str, datetime.timedelta, or DateOffset, default None) – The length of each interval. Must be consistent with the type of start and end, e.g. 2 for numeric, or ‘5H’ for datetime-like. Default is 1 for numeric and ‘D’ for datetime-like.

  • name (str, default None) – Name of the resulting IntervalIndex.

  • closed ({'left', 'right', 'both', 'neither'}, default 'right') – Whether the intervals are closed on the left-side, right-side, both or neither.

Return type:

IntervalIndex

See also

IntervalIndex

An Index of intervals that are all closed on the same side.

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting IntervalIndex will have periods linearly spaced elements between start and end, inclusively.

To learn more about datetime-like frequency strings, please see this link.

Examples

Numeric start and end is supported.

>>> pd.interval_range(start=0, end=5)
IntervalIndex([(0, 1], (1, 2], (2, 3], (3, 4], (4, 5]],
              dtype='interval[int64, right]')

Additionally, datetime-like input is also supported.

>>> pd.interval_range(start=pd.Timestamp('2017-01-01'),
...                   end=pd.Timestamp('2017-01-04'))
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03],
               (2017-01-03, 2017-01-04]],
              dtype='interval[datetime64[ns], right]')

The freq parameter specifies the frequency between the left and right. endpoints of the individual intervals within the IntervalIndex. For numeric start and end, the frequency must also be numeric.

>>> pd.interval_range(start=0, periods=4, freq=1.5)
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
              dtype='interval[float64, right]')

Similarly, for datetime-like start and end, the frequency must be convertible to a DateOffset.

>>> pd.interval_range(start=pd.Timestamp('2017-01-01'),
...                   periods=3, freq='MS')
IntervalIndex([(2017-01-01, 2017-02-01], (2017-02-01, 2017-03-01],
               (2017-03-01, 2017-04-01]],
              dtype='interval[datetime64[ns], right]')

Specify start, end, and periods; the frequency is generated automatically (linearly spaced).

>>> pd.interval_range(start=0, end=6, periods=4)
IntervalIndex([(0.0, 1.5], (1.5, 3.0], (3.0, 4.5], (4.5, 6.0]],
          dtype='interval[float64, right]')

The closed parameter specifies which endpoints of the individual intervals within the IntervalIndex are closed.

>>> pd.interval_range(end=5, periods=4, closed='both')
IntervalIndex([[1, 2], [2, 3], [3, 4], [4, 5]],
              dtype='interval[int64, both]')
pandas.isna(obj)[source]

Detect missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

Parameters:

obj (scalar or array-like) – Object to check for null or missing values.

Returns:

For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is missing.

Return type:

bool or array-like of bool

See also

notna

Boolean inverse of pandas.isna.

Series.isna

Detect missing values in a Series.

DataFrame.isna

Detect missing values in a DataFrame.

Index.isna

Detect missing values in an Index.

Examples

Scalar arguments (including strings) result in a scalar boolean.

>>> pd.isna('dog')
False
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)
True

ndarrays result in an ndarray of booleans.

>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan,  3.],
       [ 4.,  5., nan]])
>>> pd.isna(array)
array([[False,  True, False],
       [False, False,  True]])

For indexes, an ndarray of booleans is returned.

>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
...                           "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
              dtype='datetime64[ns]', freq=None)
>>> pd.isna(index)
array([False, False,  True, False])

For Series and DataFrame, the same type is returned, containing booleans.

>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
     0     1    2
0  ant   bee  cat
1  dog  None  fly
>>> pd.isna(df)
       0      1      2
0  False  False  False
1  False   True  False
>>> pd.isna(df[1])
0    False
1     True
Name: 1, dtype: bool
pandas.isnull(obj)

Detect missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

Parameters:

obj (scalar or array-like) – Object to check for null or missing values.

Returns:

For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is missing.

Return type:

bool or array-like of bool

See also

notna

Boolean inverse of pandas.isna.

Series.isna

Detect missing values in a Series.

DataFrame.isna

Detect missing values in a DataFrame.

Index.isna

Detect missing values in an Index.

Examples

Scalar arguments (including strings) result in a scalar boolean.

>>> pd.isna('dog')
False
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)
True

ndarrays result in an ndarray of booleans.

>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan,  3.],
       [ 4.,  5., nan]])
>>> pd.isna(array)
array([[False,  True, False],
       [False, False,  True]])

For indexes, an ndarray of booleans is returned.

>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
...                           "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
              dtype='datetime64[ns]', freq=None)
>>> pd.isna(index)
array([False, False,  True, False])

For Series and DataFrame, the same type is returned, containing booleans.

>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
     0     1    2
0  ant   bee  cat
1  dog  None  fly
>>> pd.isna(df)
       0      1      2
0  False  False  False
1  False   True  False
>>> pd.isna(df[1])
0    False
1     True
Name: 1, dtype: bool
pandas.json_normalize(data, record_path=None, meta=None, meta_prefix=None, record_prefix=None, errors='raise', sep='.', max_level=None)[source]

Normalize semi-structured JSON data into a flat table.

Parameters:
  • data (dict or list of dicts) – Unserialized JSON objects.

  • record_path (str or list of str, default None) – Path in each object to list of records. If not passed, data will be assumed to be an array of records.

  • meta (list of paths (str or list of str), default None) – Fields to use as metadata for each record in resulting table.

  • meta_prefix (str, default None) – If True, prefix records with dotted (?) path, e.g. foo.bar.field if meta is [‘foo’, ‘bar’].

  • record_prefix (str, default None) – If True, prefix records with dotted (?) path, e.g. foo.bar.field if path to records is [‘foo’, ‘bar’].

  • errors ({'raise', 'ignore'}, default 'raise') –

    Configures error handling.

    • ’ignore’ : will ignore KeyError if keys listed in meta are not always present.

    • ’raise’ : will raise KeyError if keys listed in meta are not always present.

  • sep (str, default '.') – Nested records will generate names separated by sep. e.g., for sep=’.’, {‘foo’: {‘bar’: 0}} -> foo.bar.

  • max_level (int, default None) – Max number of levels(depth of dict) to normalize. if None, normalizes all levels.

Returns:

  • frame (DataFrame)

  • Normalize semi-structured JSON data into a flat table.

Return type:

DataFrame

Examples

>>> data = [
...     {"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
...     {"name": {"given": "Mark", "family": "Regner"}},
...     {"id": 2, "name": "Faye Raker"},
... ]
>>> pd.json_normalize(data)
    id name.first name.last name.given name.family        name
0  1.0     Coleen      Volk        NaN         NaN         NaN
1  NaN        NaN       NaN       Mark      Regner         NaN
2  2.0        NaN       NaN        NaN         NaN  Faye Raker
>>> data = [
...     {
...         "id": 1,
...         "name": "Cole Volk",
...         "fitness": {"height": 130, "weight": 60},
...     },
...     {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
...     {
...         "id": 2,
...         "name": "Faye Raker",
...         "fitness": {"height": 130, "weight": 60},
...     },
... ]
>>> pd.json_normalize(data, max_level=0)
    id        name                        fitness
0  1.0   Cole Volk  {'height': 130, 'weight': 60}
1  NaN    Mark Reg  {'height': 130, 'weight': 60}
2  2.0  Faye Raker  {'height': 130, 'weight': 60}

Normalizes nested data up to level 1.

>>> data = [
...     {
...         "id": 1,
...         "name": "Cole Volk",
...         "fitness": {"height": 130, "weight": 60},
...     },
...     {"name": "Mark Reg", "fitness": {"height": 130, "weight": 60}},
...     {
...         "id": 2,
...         "name": "Faye Raker",
...         "fitness": {"height": 130, "weight": 60},
...     },
... ]
>>> pd.json_normalize(data, max_level=1)
    id        name  fitness.height  fitness.weight
0  1.0   Cole Volk             130              60
1  NaN    Mark Reg             130              60
2  2.0  Faye Raker             130              60
>>> data = [
...     {
...         "state": "Florida",
...         "shortname": "FL",
...         "info": {"governor": "Rick Scott"},
...         "counties": [
...             {"name": "Dade", "population": 12345},
...             {"name": "Broward", "population": 40000},
...             {"name": "Palm Beach", "population": 60000},
...         ],
...     },
...     {
...         "state": "Ohio",
...         "shortname": "OH",
...         "info": {"governor": "John Kasich"},
...         "counties": [
...             {"name": "Summit", "population": 1234},
...             {"name": "Cuyahoga", "population": 1337},
...         ],
...     },
... ]
>>> result = pd.json_normalize(
...     data, "counties", ["state", "shortname", ["info", "governor"]]
... )
>>> result
         name  population    state shortname info.governor
0        Dade       12345   Florida    FL    Rick Scott
1     Broward       40000   Florida    FL    Rick Scott
2  Palm Beach       60000   Florida    FL    Rick Scott
3      Summit        1234   Ohio       OH    John Kasich
4    Cuyahoga        1337   Ohio       OH    John Kasich
>>> data = {"A": [1, 2]}
>>> pd.json_normalize(data, "A", record_prefix="Prefix.")
    Prefix.0
0          1
1          2

Returns normalized data with columns prefixed with the given string.

pandas.lreshape(data, groups, dropna=True)[source]

Reshape wide-format data to long. Generalized inverse of DataFrame.pivot.

Accepts a dictionary, groups, in which each key is a new column name and each value is a list of old column names that will be “melted” under the new column name as part of the reshape.

Parameters:
  • data (DataFrame) – The wide-format DataFrame.

  • groups (dict) – {new_name : list_of_columns}.

  • dropna (bool, default True) – Do not include columns whose entries are all NaN.

Returns:

Reshaped DataFrame.

Return type:

DataFrame

See also

melt

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

pivot

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot

Pivot without aggregation that can handle non-numeric data.

DataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

DataFrame.unstack

Pivot based on the index values instead of a column.

wide_to_long

Wide panel to long format. Less flexible but more user-friendly than melt.

Examples

>>> data = pd.DataFrame({'hr1': [514, 573], 'hr2': [545, 526],
...                      'team': ['Red Sox', 'Yankees'],
...                      'year1': [2007, 2007], 'year2': [2008, 2008]})
>>> data
   hr1  hr2     team  year1  year2
0  514  545  Red Sox   2007   2008
1  573  526  Yankees   2007   2008
>>> pd.lreshape(data, {'year': ['year1', 'year2'], 'hr': ['hr1', 'hr2']})
      team  year   hr
0  Red Sox  2007  514
1  Yankees  2007  573
2  Red Sox  2008  545
3  Yankees  2008  526
pandas.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None, ignore_index=True)[source]

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.

Parameters:
  • id_vars (tuple, list, or ndarray, optional) – Column(s) to use as identifier variables.

  • value_vars (tuple, list, or ndarray, optional) – Column(s) to unpivot. If not specified, uses all columns that are not set as id_vars.

  • var_name (scalar) – Name to use for the ‘variable’ column. If None it uses frame.columns.name or ‘variable’.

  • value_name (scalar, default 'value') – Name to use for the ‘value’ column.

  • col_level (int or str, optional) – If columns are a MultiIndex then use this level to melt.

  • ignore_index (bool, default True) –

    If True, original index is ignored. If False, the original index is retained. Index labels will be repeated as necessary.

    New in version 1.1.0.

  • frame (DataFrame) –

Returns:

Unpivoted DataFrame.

Return type:

DataFrame

See also

DataFrame.melt

Identical method.

pivot_table

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot

Return reshaped DataFrame organized by given index / column values.

DataFrame.explode

Explode a DataFrame from list-like columns to long format.

Notes

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({'A': {0: 'a', 1: 'b', 2: 'c'},
...                    'B': {0: 1, 1: 3, 2: 5},
...                    'C': {0: 2, 1: 4, 2: 6}})
>>> df
   A  B  C
0  a  1  2
1  b  3  4
2  c  5  6
>>> pd.melt(df, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
>>> pd.melt(df, id_vars=['A'], value_vars=['B', 'C'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
3  a        C      2
4  b        C      4
5  c        C      6

The names of ‘variable’ and ‘value’ columns can be customized:

>>> pd.melt(df, id_vars=['A'], value_vars=['B'],
...         var_name='myVarname', value_name='myValname')
   A myVarname  myValname
0  a         B          1
1  b         B          3
2  c         B          5

Original index values can be kept around:

>>> pd.melt(df, id_vars=['A'], value_vars=['B', 'C'], ignore_index=False)
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
0  a        C      2
1  b        C      4
2  c        C      6

If you have multi-index columns:

>>> df.columns = [list('ABC'), list('DEF')]
>>> df
   A  B  C
   D  E  F
0  a  1  2
1  b  3  4
2  c  5  6
>>> pd.melt(df, col_level=0, id_vars=['A'], value_vars=['B'])
   A variable  value
0  a        B      1
1  b        B      3
2  c        B      5
>>> pd.melt(df, id_vars=[('A', 'D')], value_vars=[('B', 'E')])
  (A, D) variable_0 variable_1  value
0      a          B          E      1
1      b          B          E      3
2      c          B          E      5
pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=None, indicator=False, validate=None)[source]

Merge DataFrame or named Series objects with a database-style join.

A named Series object is treated as a DataFrame with a single named column.

The join is done on columns or indexes. If joining columns on columns, the DataFrame indexes will be ignored. Otherwise if joining indexes on indexes or indexes on a column or columns, the index will be passed on. When performing a cross merge, no column specifications to merge on are allowed.

Warning

If both key columns contain rows where the key is a null value, those rows will be matched against each other. This is different from usual SQL join behaviour and can lead to unexpected results.

Parameters:
  • left (DataFrame or named Series) –

  • right (DataFrame or named Series) – Object to merge with.

  • how ({'left', 'right', 'outer', 'inner', 'cross'}, default 'inner') –

    Type of merge to be performed.

    • left: use only keys from left frame, similar to a SQL left outer join; preserve key order.

    • right: use only keys from right frame, similar to a SQL right outer join; preserve key order.

    • outer: use union of keys from both frames, similar to a SQL full outer join; sort keys lexicographically.

    • inner: use intersection of keys from both frames, similar to a SQL inner join; preserve the order of the left keys.

    • cross: creates the cartesian product from both frames, preserves the order of the left keys.

      New in version 1.2.0.

  • on (label or list) – Column or index level names to join on. These must be found in both DataFrames. If on is None and not merging on indexes then this defaults to the intersection of the columns in both DataFrames.

  • left_on (label or list, or array-like) – Column or index level names to join on in the left DataFrame. Can also be an array or list of arrays of the length of the left DataFrame. These arrays are treated as if they are columns.

  • right_on (label or list, or array-like) – Column or index level names to join on in the right DataFrame. Can also be an array or list of arrays of the length of the right DataFrame. These arrays are treated as if they are columns.

  • left_index (bool, default False) – Use the index from the left DataFrame as the join key(s). If it is a MultiIndex, the number of keys in the other DataFrame (either the index or a number of columns) must match the number of levels.

  • right_index (bool, default False) – Use the index from the right DataFrame as the join key. Same caveats as left_index.

  • sort (bool, default False) – Sort the join keys lexicographically in the result DataFrame. If False, the order of the join keys depends on the join type (how keyword).

  • suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

  • copy (bool, default True) – If False, avoid copy if possible.

  • indicator (bool or str, default False) – If True, adds a column to the output DataFrame called “_merge” with information on the source of each row. The column can be given a different name by providing a string argument. The column will have a Categorical type with the value of “left_only” for observations whose merge key only appears in the left DataFrame, “right_only” for observations whose merge key only appears in the right DataFrame, and “both” if the observation’s merge key is found in both DataFrames.

  • validate (str, optional) –

    If specified, checks if merge is of specified type.

    • ”one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.

    • ”one_to_many” or “1:m”: check if merge keys are unique in left dataset.

    • ”many_to_one” or “m:1”: check if merge keys are unique in right dataset.

    • ”many_to_many” or “m:m”: allowed, but does not result in checks.

Returns:

A DataFrame of the two merged objects.

Return type:

DataFrame

See also

merge_ordered

Merge with optional filling/interpolation.

merge_asof

Merge on nearest keys.

DataFrame.join

Similar method using indices.

Notes

Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0 Support for merging named Series objects was added in version 0.24.0

Examples

>>> df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [1, 2, 3, 5]})
>>> df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
...                     'value': [5, 6, 7, 8]})
>>> df1
    lkey value
0   foo      1
1   bar      2
2   baz      3
3   foo      5
>>> df2
    rkey value
0   foo      5
1   bar      6
2   baz      7
3   foo      8

Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y, appended.

>>> df1.merge(df2, left_on='lkey', right_on='rkey')
  lkey  value_x rkey  value_y
0  foo        1  foo        5
1  foo        1  foo        8
2  foo        5  foo        5
3  foo        5  foo        8
4  bar        2  bar        6
5  baz        3  baz        7

Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey',
...           suffixes=('_left', '_right'))
  lkey  value_left rkey  value_right
0  foo           1  foo            5
1  foo           1  foo            8
2  foo           5  foo            5
3  foo           5  foo            8
4  bar           2  bar            6
5  baz           3  baz            7

Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.

>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
    Index(['value'], dtype='object')
>>> df1 = pd.DataFrame({'a': ['foo', 'bar'], 'b': [1, 2]})
>>> df2 = pd.DataFrame({'a': ['foo', 'baz'], 'c': [3, 4]})
>>> df1
      a  b
0   foo  1
1   bar  2
>>> df2
      a  c
0   foo  3
1   baz  4
>>> df1.merge(df2, how='inner', on='a')
      a  b  c
0   foo  1  3
>>> df1.merge(df2, how='left', on='a')
      a  b  c
0   foo  1  3.0
1   bar  2  NaN
>>> df1 = pd.DataFrame({'left': ['foo', 'bar']})
>>> df2 = pd.DataFrame({'right': [7, 8]})
>>> df1
    left
0   foo
1   bar
>>> df2
    right
0   7
1   8
>>> df1.merge(df2, how='cross')
   left  right
0   foo      7
1   foo      8
2   bar      7
3   bar      8
pandas.merge_asof(left, right, on=None, left_on=None, right_on=None, left_index=False, right_index=False, by=None, left_by=None, right_by=None, suffixes=('_x', '_y'), tolerance=None, allow_exact_matches=True, direction='backward')[source]

Perform a merge by key distance.

This is similar to a left-join except that we match on nearest key rather than equal keys. Both DataFrames must be sorted by the key.

For each row in the left DataFrame:

  • A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.

  • A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.

  • A “nearest” search selects the row in the right DataFrame whose ‘on’ key is closest in absolute distance to the left’s key.

The default is “backward” and is compatible in versions below 0.20.0. The direction parameter was added in version 0.20.0 and introduces “forward” and “nearest”.

Optionally match on equivalent keys with ‘by’ before searching with ‘on’.

Parameters:
  • left (DataFrame or named Series) –

  • right (DataFrame or named Series) –

  • on (label) – Field name to join on. Must be found in both DataFrames. The data MUST be ordered. Furthermore this must be a numeric column, such as datetimelike, integer, or float. On or left_on/right_on must be given.

  • left_on (label) – Field name to join on in left DataFrame.

  • right_on (label) – Field name to join on in right DataFrame.

  • left_index (bool) – Use the index of the left DataFrame as the join key.

  • right_index (bool) – Use the index of the right DataFrame as the join key.

  • by (column name or list of column names) – Match on these columns before performing merge operation.

  • left_by (column name) – Field names to match on in the left DataFrame.

  • right_by (column name) – Field names to match on in the right DataFrame.

  • suffixes (2-length sequence (tuple, list, ...)) – Suffix to apply to overlapping column names in the left and right side, respectively.

  • tolerance (int or Timedelta, optional, default None) – Select asof tolerance within this range; must be compatible with the merge index.

  • allow_exact_matches (bool, default True) –

    • If True, allow matching with the same ‘on’ value (i.e. less-than-or-equal-to / greater-than-or-equal-to)

    • If False, don’t match the same ‘on’ value (i.e., strictly less-than / strictly greater-than).

  • direction ('backward' (default), 'forward', or 'nearest') – Whether to search for prior, subsequent, or closest matches.

Return type:

DataFrame

See also

merge

Merge with a database-style join.

merge_ordered

Merge with optional filling/interpolation.

Examples

>>> left = pd.DataFrame({"a": [1, 5, 10], "left_val": ["a", "b", "c"]})
>>> left
    a left_val
0   1        a
1   5        b
2  10        c
>>> right = pd.DataFrame({"a": [1, 2, 3, 6, 7], "right_val": [1, 2, 3, 6, 7]})
>>> right
   a  right_val
0  1          1
1  2          2
2  3          3
3  6          6
4  7          7
>>> pd.merge_asof(left, right, on="a")
    a left_val  right_val
0   1        a          1
1   5        b          3
2  10        c          7
>>> pd.merge_asof(left, right, on="a", allow_exact_matches=False)
    a left_val  right_val
0   1        a        NaN
1   5        b        3.0
2  10        c        7.0
>>> pd.merge_asof(left, right, on="a", direction="forward")
    a left_val  right_val
0   1        a        1.0
1   5        b        6.0
2  10        c        NaN
>>> pd.merge_asof(left, right, on="a", direction="nearest")
    a left_val  right_val
0   1        a          1
1   5        b          6
2  10        c          7

We can use indexed DataFrames as well.

>>> left = pd.DataFrame({"left_val": ["a", "b", "c"]}, index=[1, 5, 10])
>>> left
   left_val
1         a
5         b
10        c
>>> right = pd.DataFrame({"right_val": [1, 2, 3, 6, 7]}, index=[1, 2, 3, 6, 7])
>>> right
   right_val
1          1
2          2
3          3
6          6
7          7
>>> pd.merge_asof(left, right, left_index=True, right_index=True)
   left_val  right_val
1         a          1
5         b          3
10        c          7

Here is a real-world times-series example

>>> quotes = pd.DataFrame(
...     {
...         "time": [
...             pd.Timestamp("2016-05-25 13:30:00.023"),
...             pd.Timestamp("2016-05-25 13:30:00.023"),
...             pd.Timestamp("2016-05-25 13:30:00.030"),
...             pd.Timestamp("2016-05-25 13:30:00.041"),
...             pd.Timestamp("2016-05-25 13:30:00.048"),
...             pd.Timestamp("2016-05-25 13:30:00.049"),
...             pd.Timestamp("2016-05-25 13:30:00.072"),
...             pd.Timestamp("2016-05-25 13:30:00.075")
...         ],
...         "ticker": [
...                "GOOG",
...                "MSFT",
...                "MSFT",
...                "MSFT",
...                "GOOG",
...                "AAPL",
...                "GOOG",
...                "MSFT"
...            ],
...            "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
...            "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03]
...     }
... )
>>> quotes
                     time ticker     bid     ask
0 2016-05-25 13:30:00.023   GOOG  720.50  720.93
1 2016-05-25 13:30:00.023   MSFT   51.95   51.96
2 2016-05-25 13:30:00.030   MSFT   51.97   51.98
3 2016-05-25 13:30:00.041   MSFT   51.99   52.00
4 2016-05-25 13:30:00.048   GOOG  720.50  720.93
5 2016-05-25 13:30:00.049   AAPL   97.99   98.01
6 2016-05-25 13:30:00.072   GOOG  720.50  720.88
7 2016-05-25 13:30:00.075   MSFT   52.01   52.03
>>> trades = pd.DataFrame(
...        {
...            "time": [
...                pd.Timestamp("2016-05-25 13:30:00.023"),
...                pd.Timestamp("2016-05-25 13:30:00.038"),
...                pd.Timestamp("2016-05-25 13:30:00.048"),
...                pd.Timestamp("2016-05-25 13:30:00.048"),
...                pd.Timestamp("2016-05-25 13:30:00.048")
...            ],
...            "ticker": ["MSFT", "MSFT", "GOOG", "GOOG", "AAPL"],
...            "price": [51.95, 51.95, 720.77, 720.92, 98.0],
...            "quantity": [75, 155, 100, 100, 100]
...        }
...    )
>>> trades
                     time ticker   price  quantity
0 2016-05-25 13:30:00.023   MSFT   51.95        75
1 2016-05-25 13:30:00.038   MSFT   51.95       155
2 2016-05-25 13:30:00.048   GOOG  720.77       100
3 2016-05-25 13:30:00.048   GOOG  720.92       100
4 2016-05-25 13:30:00.048   AAPL   98.00       100

By default we are taking the asof of the quotes

>>> pd.merge_asof(trades, quotes, on="time", by="ticker")
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

We only asof within 2ms between the quote time and the trade time

>>> pd.merge_asof(
...     trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms")
... )
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75   51.95   51.96
1 2016-05-25 13:30:00.038   MSFT   51.95       155     NaN     NaN
2 2016-05-25 13:30:00.048   GOOG  720.77       100  720.50  720.93
3 2016-05-25 13:30:00.048   GOOG  720.92       100  720.50  720.93
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN

We only asof within 10ms between the quote time and the trade time and we exclude exact matches on time. However prior data will propagate forward

>>> pd.merge_asof(
...     trades,
...     quotes,
...     on="time",
...     by="ticker",
...     tolerance=pd.Timedelta("10ms"),
...     allow_exact_matches=False
... )
                     time ticker   price  quantity     bid     ask
0 2016-05-25 13:30:00.023   MSFT   51.95        75     NaN     NaN
1 2016-05-25 13:30:00.038   MSFT   51.95       155   51.97   51.98
2 2016-05-25 13:30:00.048   GOOG  720.77       100     NaN     NaN
3 2016-05-25 13:30:00.048   GOOG  720.92       100     NaN     NaN
4 2016-05-25 13:30:00.048   AAPL   98.00       100     NaN     NaN
pandas.merge_ordered(left, right, on=None, left_on=None, right_on=None, left_by=None, right_by=None, fill_method=None, suffixes=('_x', '_y'), how='outer')[source]

Perform a merge for ordered data with optional filling/interpolation.

Designed for ordered data like time series data. Optionally perform group-wise merge (see examples).

Parameters:
  • left (DataFrame or named Series) –

  • right (DataFrame or named Series) –

  • on (label or list) – Field names to join on. Must be found in both DataFrames.

  • left_on (label or list, or array-like) – Field names to join on in left DataFrame. Can be a vector or list of vectors of the length of the DataFrame to use a particular vector as the join key instead of columns.

  • right_on (label or list, or array-like) – Field names to join on in right DataFrame or vector/list of vectors per left_on docs.

  • left_by (column name or list of column names) – Group left DataFrame by group columns and merge piece by piece with right DataFrame. Must be None if either left or right are a Series.

  • right_by (column name or list of column names) – Group right DataFrame by group columns and merge piece by piece with left DataFrame. Must be None if either left or right are a Series.

  • fill_method ({'ffill', None}, default None) – Interpolation method for data.

  • suffixes (list-like, default is ("_x", "_y")) – A length-2 sequence where each element is optionally a string indicating the suffix to add to overlapping column names in left and right respectively. Pass a value of None instead of a string to indicate that the column name from left or right should be left as-is, with no suffix. At least one of the values must not be None.

  • how ({'left', 'right', 'outer', 'inner'}, default 'outer') –

    • left: use only keys from left frame (SQL: left outer join)

    • right: use only keys from right frame (SQL: right outer join)

    • outer: use union of keys from both frames (SQL: full outer join)

    • inner: use intersection of keys from both frames (SQL: inner join).

Returns:

The merged DataFrame output type will be the same as ‘left’, if it is a subclass of DataFrame.

Return type:

DataFrame

See also

merge

Merge with a database-style join.

merge_asof

Merge on nearest keys.

Examples

>>> from pandas import merge_ordered
>>> df1 = pd.DataFrame(
...     {
...         "key": ["a", "c", "e", "a", "c", "e"],
...         "lvalue": [1, 2, 3, 1, 2, 3],
...         "group": ["a", "a", "a", "b", "b", "b"]
...     }
... )
>>> df1
      key  lvalue group
0   a       1     a
1   c       2     a
2   e       3     a
3   a       1     b
4   c       2     b
5   e       3     b
>>> df2 = pd.DataFrame({"key": ["b", "c", "d"], "rvalue": [1, 2, 3]})
>>> df2
      key  rvalue
0   b       1
1   c       2
2   d       3
>>> merge_ordered(df1, df2, fill_method="ffill", left_by="group")
  key  lvalue group  rvalue
0   a       1     a     NaN
1   b       1     a     1.0
2   c       2     a     2.0
3   d       2     a     3.0
4   e       3     a     3.0
5   a       1     b     NaN
6   b       1     b     1.0
7   c       2     b     2.0
8   d       2     b     3.0
9   e       3     b     3.0
pandas.notna(obj)[source]

Detect non-missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

Parameters:

obj (array-like or object value) – Object to check for not null or non-missing values.

Returns:

For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is valid.

Return type:

bool or array-like of bool

See also

isna

Boolean inverse of pandas.notna.

Series.notna

Detect valid values in a Series.

DataFrame.notna

Detect valid values in a DataFrame.

Index.notna

Detect valid values in an Index.

Examples

Scalar arguments (including strings) result in a scalar boolean.

>>> pd.notna('dog')
True
>>> pd.notna(pd.NA)
False
>>> pd.notna(np.nan)
False

ndarrays result in an ndarray of booleans.

>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan,  3.],
       [ 4.,  5., nan]])
>>> pd.notna(array)
array([[ True, False,  True],
       [ True,  True, False]])

For indexes, an ndarray of booleans is returned.

>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
...                          "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
              dtype='datetime64[ns]', freq=None)
>>> pd.notna(index)
array([ True,  True, False,  True])

For Series and DataFrame, the same type is returned, containing booleans.

>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
     0     1    2
0  ant   bee  cat
1  dog  None  fly
>>> pd.notna(df)
      0      1     2
0  True   True  True
1  True  False  True
>>> pd.notna(df[1])
0     True
1    False
Name: 1, dtype: bool
pandas.notnull(obj)

Detect non-missing values for an array-like object.

This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).

Parameters:

obj (array-like or object value) – Object to check for not null or non-missing values.

Returns:

For scalar input, returns a scalar boolean. For array input, returns an array of boolean indicating whether each corresponding element is valid.

Return type:

bool or array-like of bool

See also

isna

Boolean inverse of pandas.notna.

Series.notna

Detect valid values in a Series.

DataFrame.notna

Detect valid values in a DataFrame.

Index.notna

Detect valid values in an Index.

Examples

Scalar arguments (including strings) result in a scalar boolean.

>>> pd.notna('dog')
True
>>> pd.notna(pd.NA)
False
>>> pd.notna(np.nan)
False

ndarrays result in an ndarray of booleans.

>>> array = np.array([[1, np.nan, 3], [4, 5, np.nan]])
>>> array
array([[ 1., nan,  3.],
       [ 4.,  5., nan]])
>>> pd.notna(array)
array([[ True, False,  True],
       [ True,  True, False]])

For indexes, an ndarray of booleans is returned.

>>> index = pd.DatetimeIndex(["2017-07-05", "2017-07-06", None,
...                          "2017-07-08"])
>>> index
DatetimeIndex(['2017-07-05', '2017-07-06', 'NaT', '2017-07-08'],
              dtype='datetime64[ns]', freq=None)
>>> pd.notna(index)
array([ True,  True, False,  True])

For Series and DataFrame, the same type is returned, containing booleans.

>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
     0     1    2
0  ant   bee  cat
1  dog  None  fly
>>> pd.notna(df)
      0      1     2
0  True   True  True
1  True  False  True
>>> pd.notna(df[1])
0     True
1    False
Name: 1, dtype: bool
class pandas.option_context[source]

Context manager to temporarily set options in the with statement context.

You need to invoke as option_context(pat, val, [(pat, val), ...]).

Examples

>>> from pandas import option_context
>>> with option_context('display.max_rows', 10, 'display.max_columns', 5):
...     pass
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None)[source]

Return a fixed frequency PeriodIndex.

The day (calendar) is the default frequency.

Parameters:
  • start (str or period-like, default None) – Left bound for generating periods.

  • end (str or period-like, default None) – Right bound for generating periods.

  • periods (int, default None) – Number of periods to generate.

  • freq (str or DateOffset, optional) – Frequency alias. By default the freq is taken from start or end if those are Period objects. Otherwise, the default is "D" for daily frequency.

  • name (str, default None) – Name of the resulting PeriodIndex.

Return type:

PeriodIndex

Notes

Of the three parameters: start, end, and periods, exactly two must be specified.

To learn more about the frequency strings, please see this link.

Examples

>>> pd.period_range(start='2017-01-01', end='2018-01-01', freq='M')
PeriodIndex(['2017-01', '2017-02', '2017-03', '2017-04', '2017-05', '2017-06',
         '2017-07', '2017-08', '2017-09', '2017-10', '2017-11', '2017-12',
         '2018-01'],
        dtype='period[M]')

If start or end are Period objects, they will be used as anchor endpoints for a PeriodIndex with frequency matching that of the period_range constructor.

>>> pd.period_range(start=pd.Period('2017Q1', freq='Q'),
...                 end=pd.Period('2017Q2', freq='Q'), freq='M')
PeriodIndex(['2017-03', '2017-04', '2017-05', '2017-06'],
            dtype='period[M]')
pandas.pivot(data, *, columns, index=typing.Literal[<no_default>], values=typing.Literal[<no_default>])[source]

Return reshaped DataFrame organized by given index / column values.

Reshape data (produce a “pivot” table) based on column values. Uses unique values from specified index / columns to form axes of the resulting DataFrame. This function does not support data aggregation, multiple values will result in a MultiIndex in the columns. See the User Guide for more on reshaping.

Parameters:
  • data (DataFrame) –

  • columns (str or object or a list of str) –

    Column to use to make new frame’s columns.

    Changed in version 1.1.0: Also accept list of columns names.

  • index (str or object or a list of str, optional) –

    Column to use to make new frame’s index. If not given, uses existing index.

    Changed in version 1.1.0: Also accept list of index names.

  • values (str, object or a list of the previous, optional) – Column(s) to use for populating new frame’s values. If not specified, all remaining columns will be used and the result will have hierarchically indexed columns.

Returns:

Returns reshaped DataFrame.

Return type:

DataFrame

Raises:

ValueError: – When there are any index, columns combinations with multiple values. DataFrame.pivot_table when you need to aggregate.

See also

DataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

DataFrame.unstack

Pivot based on the index values instead of a column.

wide_to_long

Wide panel to long format. Less flexible but more user-friendly than melt.

Notes

For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({'foo': ['one', 'one', 'one', 'two', 'two',
...                            'two'],
...                    'bar': ['A', 'B', 'C', 'A', 'B', 'C'],
...                    'baz': [1, 2, 3, 4, 5, 6],
...                    'zoo': ['x', 'y', 'z', 'q', 'w', 't']})
>>> df
    foo   bar  baz  zoo
0   one   A    1    x
1   one   B    2    y
2   one   C    3    z
3   two   A    4    q
4   two   B    5    w
5   two   C    6    t
>>> df.pivot(index='foo', columns='bar', values='baz')
bar  A   B   C
foo
one  1   2   3
two  4   5   6
>>> df.pivot(index='foo', columns='bar')['baz']
bar  A   B   C
foo
one  1   2   3
two  4   5   6
>>> df.pivot(index='foo', columns='bar', values=['baz', 'zoo'])
      baz       zoo
bar   A  B  C   A  B  C
foo
one   1  2  3   x  y  z
two   4  5  6   q  w  t

You could also assign a list of column names or a list of index names.

>>> df = pd.DataFrame({
...        "lev1": [1, 1, 1, 2, 2, 2],
...        "lev2": [1, 1, 2, 1, 1, 2],
...        "lev3": [1, 2, 1, 2, 1, 2],
...        "lev4": [1, 2, 3, 4, 5, 6],
...        "values": [0, 1, 2, 3, 4, 5]})
>>> df
    lev1 lev2 lev3 lev4 values
0   1    1    1    1    0
1   1    1    2    2    1
2   1    2    1    3    2
3   2    1    2    4    3
4   2    1    1    5    4
5   2    2    2    6    5
>>> df.pivot(index="lev1", columns=["lev2", "lev3"], values="values")
lev2    1         2
lev3    1    2    1    2
lev1
1     0.0  1.0  2.0  NaN
2     4.0  3.0  NaN  5.0
>>> df.pivot(index=["lev1", "lev2"], columns=["lev3"], values="values")
      lev3    1    2
lev1  lev2
   1     1  0.0  1.0
         2  2.0  NaN
   2     1  4.0  3.0
         2  NaN  5.0

A ValueError is raised if there are any duplicates.

>>> df = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
...                    "bar": ['A', 'A', 'B', 'C'],
...                    "baz": [1, 2, 3, 4]})
>>> df
   foo bar  baz
0  one   A    1
1  one   A    2
2  two   B    3
3  two   C    4

Notice that the first two rows are the same for our index and columns arguments.

>>> df.pivot(index='foo', columns='bar', values='baz')
Traceback (most recent call last):
   ...
ValueError: Index contains duplicate entries, cannot reshape
pandas.pivot_table(data, values=None, index=None, columns=None, aggfunc='mean', fill_value=None, margins=False, dropna=True, margins_name='All', observed=False, sort=True)[source]

Create a spreadsheet-style pivot table as a DataFrame.

The levels in the pivot table will be stored in MultiIndex objects (hierarchical indexes) on the index and columns of the result DataFrame.

Parameters:
  • data (DataFrame) –

  • values (list-like or scalar, optional) – Column or columns to aggregate.

  • index (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table index. If an array is passed, it is being used as the same manner as column values.

  • columns (column, Grouper, array, or list of the previous) – If an array is passed, it must be the same length as the data. The list can contain any of the other types (except list). Keys to group by on the pivot table column. If an array is passed, it is being used as the same manner as column values.

  • aggfunc (function, list of functions, dict, default numpy.mean) – If list of functions passed, the resulting pivot table will have hierarchical columns whose top level are the function names (inferred from the function objects themselves) If dict is passed, the key is column to aggregate and value is function or list of functions. If margin=True, aggfunc will be used to calculate the partial aggregates.

  • fill_value (scalar, default None) – Value to replace missing values with (in the resulting pivot table, after aggregation).

  • margins (bool, default False) – If margins=True, special All columns and rows will be added with partial group aggregates across the categories on the rows and columns.

  • dropna (bool, default True) – Do not include columns whose entries are all NaN. If True, rows with a NaN value in any column will be omitted before computing margins.

  • margins_name (str, default 'All') – Name of the row / column that will contain the totals when margins is True.

  • observed (bool, default False) – This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.

  • sort (bool, default True) –

    Specifies if the result should be sorted.

    New in version 1.3.0.

Returns:

An Excel style pivot table.

Return type:

DataFrame

See also

DataFrame.pivot

Pivot without aggregation that can handle non-numeric data.

DataFrame.melt

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

wide_to_long

Wide panel to long format. Less flexible but more user-friendly than melt.

Notes

Reference the user guide for more examples.

Examples

>>> df = pd.DataFrame({"A": ["foo", "foo", "foo", "foo", "foo",
...                          "bar", "bar", "bar", "bar"],
...                    "B": ["one", "one", "one", "two", "two",
...                          "one", "one", "two", "two"],
...                    "C": ["small", "large", "large", "small",
...                          "small", "large", "small", "small",
...                          "large"],
...                    "D": [1, 2, 2, 3, 3, 4, 5, 6, 7],
...                    "E": [2, 4, 5, 5, 6, 6, 8, 9, 9]})
>>> df
     A    B      C  D  E
0  foo  one  small  1  2
1  foo  one  large  2  4
2  foo  one  large  2  5
3  foo  two  small  3  5
4  foo  two  small  3  6
5  bar  one  large  4  6
6  bar  one  small  5  8
7  bar  two  small  6  9
8  bar  two  large  7  9

This first example aggregates values by taking the sum.

>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
...                        columns=['C'], aggfunc=np.sum)
>>> table
C        large  small
A   B
bar one    4.0    5.0
    two    7.0    6.0
foo one    4.0    1.0
    two    NaN    6.0

We can also fill missing values using the fill_value parameter.

>>> table = pd.pivot_table(df, values='D', index=['A', 'B'],
...                        columns=['C'], aggfunc=np.sum, fill_value=0)
>>> table
C        large  small
A   B
bar one      4      5
    two      7      6
foo one      4      1
    two      0      6

The next example aggregates by taking the mean across multiple columns.

>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                        aggfunc={'D': np.mean, 'E': np.mean})
>>> table
                D         E
A   C
bar large  5.500000  7.500000
    small  5.500000  8.500000
foo large  2.000000  4.500000
    small  2.333333  4.333333

We can also calculate multiple types of aggregations for any given value column.

>>> table = pd.pivot_table(df, values=['D', 'E'], index=['A', 'C'],
...                        aggfunc={'D': np.mean,
...                                 'E': [min, max, np.mean]})
>>> table
                  D   E
               mean max      mean  min
A   C
bar large  5.500000   9  7.500000    6
    small  5.500000   9  8.500000    8
foo large  2.000000   5  4.500000    4
    small  2.333333   6  4.333333    2
pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')[source]

Quantile-based discretization function.

Discretize variable into equal-sized buckets based on rank or based on sample quantiles. For example 1000 values for 10 quantiles would produce a Categorical object indicating quantile membership for each data point.

Parameters:
  • x (1d ndarray or Series) –

  • q (int or list-like of float) – Number of quantiles. 10 for deciles, 4 for quartiles, etc. Alternately array of quantiles, e.g. [0, .25, .5, .75, 1.] for quartiles.

  • labels (array or False, default None) – Used as labels for the resulting bins. Must be of the same length as the resulting bins. If False, return only integer indicators of the bins. If True, raises an error.

  • retbins (bool, optional) – Whether to return the (bins, labels) or not. Can be useful if bins is given as a scalar.

  • precision (int, optional) – The precision at which to store and display the bins labels.

  • duplicates ({default 'raise', 'drop'}, optional) – If bin edges are not unique, raise ValueError or drop non-uniques.

Returns:

  • out (Categorical or Series or array of integers if labels is False) – The return type (Categorical or Series) depends on the input: a Series of type category if input is a Series else Categorical. Bins are represented as categories when categorical data is returned.

  • bins (ndarray of floats) – Returned only if retbins is True.

Notes

Out of bounds values will be NA in the resulting Categorical object

Examples

>>> pd.qcut(range(5), 4)
... 
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64, right]): [(-0.001, 1.0] < (1.0, 2.0] ...
>>> pd.qcut(range(5), 3, labels=["good", "medium", "bad"])
... 
[good, good, medium, bad, bad]
Categories (3, object): [good < medium < bad]
>>> pd.qcut(range(5), 4, labels=False)
array([0, 0, 1, 2, 3])
pandas.read_clipboard(sep='\\s+', dtype_backend=_NoDefault.no_default, **kwargs)[source]

Read text from clipboard and pass to read_csv.

Parameters:
  • sep (str, default 's+') – A string or regex delimiter. The default of ‘s+’ denotes one or more whitespace characters.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

  • **kwargs – See read_csv for the full argument list.

Returns:

A parsed DataFrame object.

Return type:

DataFrame

pandas.read_csv(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=None, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)[source]

Read a comma-separated values (csv) file into DataFrame.

Also supports optionally iterating or breaking of the file into chunks.

Additional help can be found in the online docs for IO Tools.

Parameters:
  • filepath_or_buffer (str, path object or file-like object) –

    Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

  • sep (str, default ',') – Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.

  • delimiter (str, default None) – Alias for sep.

  • header (int, list of int, None, default 'infer') – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

  • names (array-like, optional) – List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

  • index_col (int, str, sequence of int / str, or False, optional, default None) –

    Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.

    Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

  • usecols (list-like or callable, optional) –

    Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If names are given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

    If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.

  • dtype (Type name or dict of column -> type, optional) –

    Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

    New in version 1.5.0: Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed.

  • engine ({'c', 'python', 'pyarrow'}, optional) –

    Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine.

    New in version 1.4.0: The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine.

  • converters (dict, optional) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

  • true_values (list, optional) – Values to consider as True in addition to case-insensitive variants of “True”.

  • false_values (list, optional) – Values to consider as False in addition to case-insensitive variants of “False”.

  • skipinitialspace (bool, default False) – Skip spaces after delimiter.

  • skiprows (list-like, int or callable, optional) –

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

  • skipfooter (int, default 0) – Number of lines at bottom of file to skip (Unsupported with engine=’c’).

  • nrows (int, optional) – Number of rows of file to read. Useful for reading pieces of large files.

  • na_values (scalar, str, list-like, or dict, optional) – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.

  • keep_default_na (bool, default True) –

    Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:

    • If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

    • If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

    • If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

    • If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

    Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.

  • na_filter (bool, default True) – Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.

  • verbose (bool, default False) – Indicate number of NA values placed in non-numeric columns.

  • skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.

  • parse_dates (bool or list of int or names or list of lists or dict, default False) –

    The behavior is as follows:

    • boolean. If True -> try parsing the index.

    • list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.

    • list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

    • dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

    If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

    Note: A fast-path exists for iso8601-formatted dates.

  • infer_datetime_format (bool, default False) –

    If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

    Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.

  • keep_date_col (bool, default False) – If True and parse_dates specifies combining multiple columns then keep the original columns.

  • date_parser (function, optional) –

    Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

    Deprecated since version 2.0.0: Use date_format instead, or read in as object and then apply to_datetime() as-needed.

  • date_format (str or dict of column -> format, default None) –

    If used in conjunction with parse_dates, will parse dates according to this format. For anything more complex, please read in as object and then apply to_datetime() as-needed.

    New in version 2.0.0.

  • dayfirst (bool, default False) – DD/MM format dates, international and European format.

  • cache_dates (bool, default True) – If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.

  • iterator (bool, default False) –

    Return TextFileReader object for iteration or getting chunks with get_chunk().

    Changed in version 1.2: TextFileReader is a context manager.

  • chunksize (int, optional) –

    Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

    Changed in version 1.2: TextFileReader is a context manager.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • thousands (str, optional) – Thousands separator.

  • decimal (str, default '.') – Character to recognize as decimal point (e.g. use ‘,’ for European data).

  • lineterminator (str (length 1), optional) – Character to break file into lines. Only valid with C parser.

  • quotechar (str (length 1), optional) – The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

  • quoting (int or csv.QUOTE_* instance, default 0) – Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

  • doublequote (bool, default True) – When quotechar is specified and quoting is not QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements INSIDE a field as a single quotechar element.

  • escapechar (str (length 1), optional) – One-character string used to escape other characters.

  • comment (str, optional) – Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.

  • encoding (str, optional, default "utf-8") –

    Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings .

    Changed in version 1.2: When encoding is None, errors="replace" is passed to open(). Otherwise, errors="strict" is passed to open(). This behavior was previously only the case for engine="python".

    Changed in version 1.3.0: encoding_errors is a new argument. encoding has no longer an influence on how encoding errors are handled.

  • encoding_errors (str, optional, default "strict") –

    How encoding errors are treated. List of possible values .

    New in version 1.3.0.

  • dialect (str or csv.Dialect, optional) – If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for more details.

  • on_bad_lines ({'error', 'warn', 'skip'} or callable, default 'error') –

    Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :

    • ’error’, raise an Exception when a bad line is encountered.

    • ’warn’, raise a warning when a bad line is encountered and skip that line.

    • ’skip’, skip bad lines without raising or warning when they are encountered.

    New in version 1.3.0.

    New in version 1.4.0:

    • callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"

  • delim_whitespace (bool, default False) – Specifies whether or not whitespace (e.g. ' ' or '    ') will be used as the sep. Equivalent to setting sep='\s+'. If this option is set to True, nothing should be passed in for the delimiter parameter.

  • low_memory (bool, default True) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).

  • memory_map (bool, default False) – If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

  • float_precision (str, optional) –

    Specifies which converter the C engine should use for floating-point values. The options are None or ‘high’ for the ordinary converter, ‘legacy’ for the original lower precision pandas converter, and ‘round_trip’ for the round-trip converter.

    Changed in version 1.2.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.

Return type:

DataFrame or TextFileReader

See also

DataFrame.to_csv

Write DataFrame to a comma-separated values (csv) file.

read_csv

Read a comma-separated values (csv) file into DataFrame.

read_fwf

Read a table of fixed-width formatted lines into DataFrame.

Examples

>>> pd.read_csv('data.csv')  
pandas.read_excel(io, sheet_name=0, *, header=0, names=None, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skiprows=None, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, parse_dates=False, date_parser=_NoDefault.no_default, date_format=None, thousands=None, decimal='.', comment=None, skipfooter=0, storage_options=None, dtype_backend=_NoDefault.no_default)[source]

Read an Excel file into a pandas DataFrame.

Supports xls, xlsx, xlsm, xlsb, odf, ods and odt file extensions read from a local filesystem or URL. Supports an option to read a single sheet or a list of sheets.

Parameters:
  • io (str, bytes, ExcelFile, xlrd.Book, path object, or file-like object) –

    Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.xlsx.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

  • sheet_name (str, int, list, or None, default 0) –

    Strings are used for sheet names. Integers are used in zero-indexed sheet positions (chart sheets do not count as a sheet position). Lists of strings/integers are used to request multiple sheets. Specify None to get all worksheets.

    Available cases:

    • Defaults to 0: 1st sheet as a DataFrame

    • 1: 2nd sheet as a DataFrame

    • "Sheet1": Load sheet with name “Sheet1”

    • [0, 1, "Sheet5"]: Load first, second and sheet named “Sheet5” as a dict of DataFrame

    • None: All worksheets.

  • header (int, list of int, default 0) – Row (0-indexed) to use for the column labels of the parsed DataFrame. If a list of integers is passed those row positions will be combined into a MultiIndex. Use None if there is no header.

  • names (array-like, default None) – List of column names to use. If file contains no header row, then you should explicitly pass header=None.

  • index_col (int, list of int, default None) –

    Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

    Missing values will be forward filled to allow roundtripping with to_excel for merged_cells=True. To avoid forward filling the missing values use set_index after reading the data instead of index_col.

  • usecols (str, list-like, or callable, default None) –

    • If None, then parse all columns.

    • If str, then indicates comma separated list of Excel column letters and column ranges (e.g. “A:E” or “A,C,E:F”). Ranges are inclusive of both sides.

    • If list of int, then indicates list of column numbers to be parsed (0-indexed).

    • If list of string, then indicates list of column names to be parsed.

    • If callable, then evaluate each column name against it and parse the column if the callable returns True.

    Returns a subset of the columns according to behavior above.

  • dtype (Type name or dict of column -> type, default None) – Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use object to preserve data as stored in Excel and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

  • engine (str, default None) –

    If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :

    • ”xlrd” supports old-style Excel files (.xls).

    • ”openpyxl” supports newer Excel file formats.

    • ”odf” supports OpenDocument file formats (.odf, .ods, .odt).

    • ”pyxlsb” supports Binary Excel files.

    Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:

    • If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.

    • Otherwise if path_or_buffer is an xls format, xlrd will be used.

    • Otherwise if path_or_buffer is in xlsb format, pyxlsb will be used.

      New in version 1.3.0.

    • Otherwise openpyxl will be used.

      Changed in version 1.3.0.

  • converters (dict, default None) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the Excel cell content, and return the transformed content.

  • true_values (list, default None) – Values to consider as True.

  • false_values (list, default None) – Values to consider as False.

  • skiprows (list-like, int, or callable, optional) – Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file. If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

  • nrows (int, default None) – Number of rows to parse.

  • na_values (scalar, str, list-like, or dict, default None) – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.

  • keep_default_na (bool, default True) –

    Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:

    • If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

    • If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

    • If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

    • If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

    Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.

  • na_filter (bool, default True) – Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.

  • verbose (bool, default False) – Indicate number of NA values placed in non-numeric columns.

  • parse_dates (bool, list-like, or dict, default False) –

    The behavior is as follows:

    • bool. If True -> try parsing the index.

    • list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.

    • list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

    • dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

    If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an object data type. If you don`t want to parse some cells as date just change their type in Excel to “Text”. For non-standard datetime parsing, use pd.to_datetime after pd.read_excel.

    Note: A fast-path exists for iso8601-formatted dates.

  • date_parser (function, optional) –

    Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

    Deprecated since version 2.0.0: Use date_format instead, or read in as object and then apply to_datetime() as-needed.

  • date_format (str or dict of column -> format, default None) –

    If used in conjunction with parse_dates, will parse dates according to this format. For anything more complex, please read in as object and then apply to_datetime() as-needed.

    New in version 2.0.0.

  • thousands (str, default None) – Thousands separator for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.

  • decimal (str, default '.') –

    Character to recognize as decimal point for parsing string columns to numeric. Note that this parameter is only necessary for columns stored as TEXT in Excel, any numeric columns will automatically be parsed, regardless of display format.(e.g. use ‘,’ for European data).

    New in version 1.4.0.

  • comment (str, default None) – Comments out remainder of line. Pass a character or characters to this argument to indicate comments in the input file. Any data between the comment string and the end of the current line is ignored.

  • skipfooter (int, default 0) – Rows at the end to skip (0-indexed).

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

DataFrame from the passed in Excel file. See notes in sheet_name argument for more information on when a dict of DataFrames is returned.

Return type:

DataFrame or dict of DataFrames

See also

DataFrame.to_excel

Write DataFrame to an Excel file.

DataFrame.to_csv

Write DataFrame to a comma-separated values (csv) file.

read_csv

Read a comma-separated values (csv) file into DataFrame.

read_fwf

Read a table of fixed-width formatted lines into DataFrame.

Examples

The file can be read using the file name as string or an open file object:

>>> pd.read_excel('tmp.xlsx', index_col=0)  
       Name  Value
0   string1      1
1   string2      2
2  #Comment      3
>>> pd.read_excel(open('tmp.xlsx', 'rb'),
...               sheet_name='Sheet3')  
   Unnamed: 0      Name  Value
0           0   string1      1
1           1   string2      2
2           2  #Comment      3

Index and header can be specified via the index_col and header arguments

>>> pd.read_excel('tmp.xlsx', index_col=None, header=None)  
     0         1      2
0  NaN      Name  Value
1  0.0   string1      1
2  1.0   string2      2
3  2.0  #Comment      3

Column types are inferred but can be explicitly specified

>>> pd.read_excel('tmp.xlsx', index_col=0,
...               dtype={'Name': str, 'Value': float})  
       Name  Value
0   string1    1.0
1   string2    2.0
2  #Comment    3.0

True, False, and NA values, and thousands separators have defaults, but can be explicitly specified, too. Supply the values you would like as strings or lists of strings!

>>> pd.read_excel('tmp.xlsx', index_col=0,
...               na_values=['string1', 'string2'])  
       Name  Value
0       NaN      1
1       NaN      2
2  #Comment      3

Comment lines in the excel input file can be skipped using the comment kwarg

>>> pd.read_excel('tmp.xlsx', index_col=0, comment='#')  
      Name  Value
0  string1    1.0
1  string2    2.0
2     None    NaN
pandas.read_feather(path, columns=None, use_threads=True, storage_options=None, dtype_backend=_NoDefault.no_default)[source]

Load a feather-format object from the file path.

Parameters:
  • path (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.feather.

  • columns (sequence, default None) – If not provided, all columns are read.

  • use_threads (bool, default True) – Whether to parallelize reading using multiple threads.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Return type:

type of object stored in file

pandas.read_fwf(filepath_or_buffer, *, colspecs='infer', widths=None, infer_nrows=100, dtype_backend=_NoDefault.no_default, **kwds)[source]

Read a table of fixed-width formatted lines into DataFrame.

Also supports optionally iterating or breaking of the file into chunks.

Additional help can be found in the online docs for IO Tools.

Parameters:
  • filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a text read() function.The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

  • colspecs (list of tuple (int, int) or 'infer'. optional) – A list of tuples giving the extents of the fixed-width fields of each line as half-open intervals (i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the column specifications from the first 100 rows of the data which are not being skipped via skiprows (default=’infer’).

  • widths (list of int, optional) – A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.

  • infer_nrows (int, default 100) – The number of rows to consider when letting the parser determine the colspecs.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

  • **kwds (optional) – Optional keyword arguments can be passed to TextFileReader.

Returns:

A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.

Return type:

DataFrame or TextFileReader

See also

DataFrame.to_csv

Write DataFrame to a comma-separated values (csv) file.

read_csv

Read a comma-separated values (csv) file into DataFrame.

Examples

>>> pd.read_fwf('data.csv')  
pandas.read_gbq(query, project_id=None, index_col=None, col_order=None, reauth=False, auth_local_webserver=True, dialect=None, location=None, configuration=None, credentials=None, use_bqstorage_api=None, max_results=None, progress_bar_type=None)[source]

Load data from Google BigQuery.

This function requires the pandas-gbq package.

See the How to authenticate with Google BigQuery guide for authentication instructions.

Parameters:
  • query (str) – SQL-Like Query to return data values.

  • project_id (str, optional) – Google BigQuery Account project ID. Optional when available from the environment.

  • index_col (str, optional) – Name of result column to use for index in results DataFrame.

  • col_order (list(str), optional) – List of BigQuery column names in the desired order for results DataFrame.

  • reauth (bool, default False) – Force Google BigQuery to re-authenticate the user. This is useful if multiple accounts are used.

  • auth_local_webserver (bool, default True) –

    Use the local webserver flow instead of the console flow when getting user credentials.

    New in version 0.2.0 of pandas-gbq.

    Changed in version 1.5.0: Default value is changed to True. Google has deprecated the auth_local_webserver = False “out of band” (copy-paste) flow.

  • dialect (str, default 'legacy') –

    Note: The default value is changing to ‘standard’ in a future version.

    SQL syntax dialect to use. Value can be one of:

    'legacy'

    Use BigQuery’s legacy SQL dialect. For more information see BigQuery Legacy SQL Reference.

    'standard'

    Use BigQuery’s standard SQL, which is compliant with the SQL 2011 standard. For more information see BigQuery Standard SQL Reference.

  • location (str, optional) –

    Location where the query job should run. See the BigQuery locations documentation for a list of available locations. The location must match that of any datasets used in the query.

    New in version 0.5.0 of pandas-gbq.

  • configuration (dict, optional) –

    Query config parameters for job processing. For example:

    configuration = {‘query’: {‘useQueryCache’: False}}

    For more information see BigQuery REST API Reference.

  • credentials (google.auth.credentials.Credentials, optional) –

    Credentials for accessing Google APIs. Use this parameter to override default credentials, such as to use Compute Engine google.auth.compute_engine.Credentials or Service Account google.oauth2.service_account.Credentials directly.

    New in version 0.8.0 of pandas-gbq.

  • use_bqstorage_api (bool, default False) –

    Use the BigQuery Storage API to download query results quickly, but at an increased cost. To use this API, first enable it in the Cloud Console. You must also have the bigquery.readsessions.create permission on the project you are billing queries to.

    This feature requires version 0.10.0 or later of the pandas-gbq package. It also requires the google-cloud-bigquery-storage and fastavro packages.

  • max_results (int, optional) –

    If set, limit the maximum number of rows to fetch from the query results.

    New in version 0.12.0 of pandas-gbq.

    New in version 1.1.0.

  • progress_bar_type (Optional, str) –

    If set, use the tqdm library to display a progress bar while the data downloads. Install the tqdm package to use this feature.

    Possible values of progress_bar_type include:

    None

    No progress bar.

    'tqdm'

    Use the tqdm.tqdm() function to print a progress bar to sys.stderr.

    'tqdm_notebook'

    Use the tqdm.tqdm_notebook() function to display a progress bar as a Jupyter notebook widget.

    'tqdm_gui'

    Use the tqdm.tqdm_gui() function to display a progress bar as a graphical dialog box.

    Note that this feature requires version 0.12.0 or later of the pandas-gbq package. And it requires the tqdm package. Slightly different than pandas-gbq, here the default is None.

Returns:

df – DataFrame representing results of query.

Return type:

DataFrame

See also

pandas_gbq.read_gbq

This function in the pandas-gbq library.

DataFrame.to_gbq

Write a DataFrame to Google BigQuery.

pandas.read_hdf(path_or_buf, key=None, mode='r', errors='strict', where=None, start=None, stop=None, columns=None, iterator=False, chunksize=None, **kwargs)[source]

Read from the store, close it if we opened it.

Retrieve pandas object stored in file, optionally based on where criteria.

Warning

Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can be unsafe.

See: https://docs.python.org/3/library/pickle.html for more.

Parameters:
  • path_or_buf (str, path object, pandas.HDFStore) –

    Any valid string path is acceptable. Only supports the local file system, remote URLs and file-like objects are not supported.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    Alternatively, pandas accepts an open pandas.HDFStore object.

  • key (object, optional) – The group identifier in the store. Can be omitted if the HDF file contains a single pandas object.

  • mode ({'r', 'r+', 'a'}, default 'r') – Mode to use when opening the file. Ignored if path_or_buf is a pandas.HDFStore. Default is ‘r’.

  • errors (str, default 'strict') – Specifies how encoding and decoding errors are to be handled. See the errors argument for open() for a full list of options.

  • where (list, optional) – A list of Term (or convertible) objects.

  • start (int, optional) – Row number to start selection.

  • stop (int, optional) – Row number to stop selection.

  • columns (list, optional) – A list of columns names to return.

  • iterator (bool, optional) – Return an iterator object.

  • chunksize (int, optional) – Number of rows to include in an iteration when using an iterator.

  • **kwargs – Additional keyword arguments passed to HDFStore.

Returns:

The selected object. Return type depends on the object stored.

Return type:

object

See also

DataFrame.to_hdf

Write a HDF file from a DataFrame.

HDFStore

Low-level access to HDF files.

Examples

>>> df = pd.DataFrame([[1, 1.0, 'a']], columns=['x', 'y', 'z'])  
>>> df.to_hdf('./store.h5', 'data')  
>>> reread = pd.read_hdf('./store.h5')  
pandas.read_html(io, *, match='.+', flavor=None, header=None, index_col=None, skiprows=None, attrs=None, parse_dates=False, thousands=',', encoding=None, decimal='.', converters=None, na_values=None, keep_default_na=True, displayed_only=True, extract_links=None, dtype_backend=_NoDefault.no_default)[source]

Read HTML tables into a list of DataFrame objects.

Parameters:
  • io (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a string read() function. The string can represent a URL or the HTML itself. Note that lxml only accepts the http, ftp and file url protocols. If you have a URL that starts with 'https' you might try removing the 's'.

  • match (str or compiled regular expression, optional) – The set of tables containing text matching this regex or string will be returned. Unless the HTML is extremely simple you will probably need to pass a non-empty string here. Defaults to ‘.+’ (match any non-empty string). The default value will return all tables contained on a page. This value is converted to a regular expression so that there is consistent behavior between Beautiful Soup and lxml.

  • flavor (str, optional) – The parsing engine to use. ‘bs4’ and ‘html5lib’ are synonymous with each other, they are both there for backwards compatibility. The default of None tries to use lxml to parse and if that fails it falls back on bs4 + html5lib.

  • header (int or list-like, optional) – The row (or list of rows for a MultiIndex) to use to make the columns headers.

  • index_col (int or list-like, optional) – The column (or list of columns) to use to create the index.

  • skiprows (int, list-like or slice, optional) – Number of rows to skip after parsing the column integer. 0-based. If a sequence of integers or a slice is given, will skip the rows indexed by that sequence. Note that a single element sequence means ‘skip the nth row’ whereas an integer means ‘skip n rows’.

  • attrs (dict, optional) –

    This is a dictionary of attributes that you can pass to use to identify the table in the HTML. These are not checked for validity before being passed to lxml or Beautiful Soup. However, these attributes must be valid HTML table attributes to work correctly. For example,

    attrs = {'id': 'table'}
    

    is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML attribute for any HTML tag as per this document.

    attrs = {'asdf': 'table'}
    

    is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even if it is a valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft of the HTML 5 spec can be found here. It contains the latest information on table attributes for the modern web.

  • parse_dates (bool, optional) – See read_csv() for more details.

  • thousands (str, optional) – Separator to use to parse thousands. Defaults to ','.

  • encoding (str, optional) – The encoding used to decode the web page. Defaults to None.``None`` preserves the previous encoding behavior, which depends on the underlying parser library (e.g., the parser library will try to use the encoding provided by the document).

  • decimal (str, default '.') – Character to recognize as decimal point (e.g. use ‘,’ for European data).

  • converters (dict, default None) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels, values are functions that take one input argument, the cell (not column) content, and return the transformed content.

  • na_values (iterable, default None) – Custom NA values.

  • keep_default_na (bool, default True) – If na_values are specified and keep_default_na is False the default NaN values are overridden, otherwise they’re appended to.

  • displayed_only (bool, default True) – Whether elements with “display: none” should be parsed.

  • extract_links ({None, "all", "header", "body", "footer"}) –

    Table elements in the specified section(s) with <a> tags will have their href extracted.

    New in version 1.5.0.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

A list of DataFrames.

Return type:

dfs

See also

read_csv

Read a comma-separated values (csv) file into DataFrame.

Notes

Before using this function you should read the gotchas about the HTML parsing libraries.

Expect to do some cleanup after you call this function. For example, you might need to manually assign column names if the column names are converted to NaN when you pass the header=0 argument. We try to assume as little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table to the user.

This function searches for <table> elements and only for <tr> and <th> rows and <td> elements within each <tr> or <th> element in the table. <td> stands for “table data”. This function attempts to properly handle colspan and rowspan attributes. If the function has a <thead> argument, it is used to construct the header, otherwise the function attempts to find the header within the body (by putting rows with only <th> elements into the header).

Similar to read_csv() the header argument is applied after skiprows is applied.

This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.

Examples

See the read_html documentation in the IO section of the docs for some examples of reading in HTML tables.

pandas.read_json(path_or_buf, *, orient=None, typ='frame', dtype=None, convert_axes=None, convert_dates=True, keep_default_dates=True, precise_float=False, date_unit=None, encoding=None, encoding_errors='strict', lines=False, chunksize=None, compression='infer', nrows=None, storage_options=None, dtype_backend=_NoDefault.no_default, engine='ujson')[source]

Convert a JSON string to pandas object.

Parameters:
  • path_or_buf (a valid JSON str, path object or file-like object) –

    Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.json.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

  • orient (str, optional) –

    Indication of expected JSON string format. Compatible JSON strings can be produced by to_json() with a corresponding orient value. The set of possible orients is:

    • 'split' : dict like {index -> [index], columns -> [columns], data -> [values]}

    • 'records' : list like [{column -> value}, ... , {column -> value}]

    • 'index' : dict like {index -> {column -> value}}

    • 'columns' : dict like {column -> {index -> value}}

    • 'values' : just the values array

    The allowed and default values depend on the value of the typ parameter.

    • when typ == 'series',

      • allowed orients are {'split','records','index'}

      • default is 'index'

      • The Series index must be unique for orient 'index'.

    • when typ == 'frame',

      • allowed orients are {'split','records','index', 'columns','values', 'table'}

      • default is 'columns'

      • The DataFrame index must be unique for orients 'index' and 'columns'.

      • The DataFrame columns must be unique for orients 'index', 'columns', and 'records'.

  • typ ({'frame', 'series'}, default 'frame') – The type of object to recover.

  • dtype (bool or dict, default None) –

    If True, infer dtypes; if a dict of column to dtype, then use those; if False, then don’t infer dtypes at all, applies only to the data.

    For all orient values except 'table', default is True.

  • convert_axes (bool, default None) –

    Try to convert the axes to the proper dtypes.

    For all orient values except 'table', default is True.

  • convert_dates (bool or list of str, default True) – If True then default datelike columns may be converted (depending on keep_default_dates). If False, no dates will be converted. If a list of column names, then those columns will be converted and default datelike columns may also be converted (depending on keep_default_dates).

  • keep_default_dates (bool, default True) –

    If parsing dates (convert_dates is not False), then try to parse the default datelike columns. A column label is datelike if

    • it ends with '_at',

    • it ends with '_time',

    • it begins with 'timestamp',

    • it is 'modified', or

    • it is 'date'.

  • precise_float (bool, default False) – Set to enable usage of higher precision (strtod) function when decoding string to double values. Default (False) is to use fast but less precise builtin functionality.

  • date_unit (str, default None) – The timestamp unit to detect if converting dates. The default behaviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds respectively.

  • encoding (str, default is 'utf-8') – The encoding to use to decode py3 bytes.

  • encoding_errors (str, optional, default "strict") –

    How encoding errors are treated. List of possible values .

    New in version 1.3.0.

  • lines (bool, default False) – Read the file as a json object per line.

  • chunksize (int, optional) –

    Return JsonReader object for iteration. See the line-delimited json docs for more information on chunksize. This can only be passed if lines=True. If this is None, the file will be read into memory all at once.

    Changed in version 1.2: JsonReader is a context manager.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘path_or_buf’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • nrows (int, optional) –

    The number of lines from the line-delimited jsonfile that has to be read. This can only be passed if lines=True. If this is None, all the rows will be returned.

    New in version 1.1.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

  • engine ({"ujson", "pyarrow"}, default "ujson") –

    Parser engine to use. The "pyarrow" engine is only available when lines=True.

    New in version 2.0.

Returns:

The type returned depends on the value of typ.

Return type:

Series or DataFrame

See also

DataFrame.to_json

Convert a DataFrame to a JSON string.

Series.to_json

Convert a Series to a JSON string.

json_normalize

Normalize semi-structured JSON data into a flat table.

Notes

Specific to orient='table', if a DataFrame with a literal Index name of index gets written with to_json(), the subsequent read operation will incorrectly set the Index name to None. This is because index is also used by DataFrame.to_json() to denote a missing Index name, and the subsequent read_json() operation cannot distinguish between the two. The same limitation is encountered with a MultiIndex and any names beginning with 'level_'.

Examples

>>> df = pd.DataFrame([['a', 'b'], ['c', 'd']],
...                   index=['row 1', 'row 2'],
...                   columns=['col 1', 'col 2'])

Encoding/decoding a Dataframe using 'split' formatted JSON:

>>> df.to_json(orient='split')
    '{"columns":["col 1","col 2"],"index":["row 1","row 2"],"data":[["a","b"],["c","d"]]}'
>>> pd.read_json(_, orient='split')
      col 1 col 2
row 1     a     b
row 2     c     d

Encoding/decoding a Dataframe using 'index' formatted JSON:

>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> pd.read_json(_, orient='index')
      col 1 col 2
row 1     a     b
row 2     c     d

Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are not preserved with this encoding.

>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> pd.read_json(_, orient='records')
  col 1 col 2
0     a     b
1     c     d

Encoding with Table Schema

>>> df.to_json(orient='table')
    '{"schema":{"fields":[{"name":"index","type":"string"},{"name":"col 1","type":"string"},{"name":"col 2","type":"string"}],"primaryKey":["index"],"pandas_version":"1.4.0"},"data":[{"index":"row 1","col 1":"a","col 2":"b"},{"index":"row 2","col 1":"c","col 2":"d"}]}'
pandas.read_orc(path, columns=None, dtype_backend=_NoDefault.no_default, **kwargs)[source]

Load an ORC object from the file path, returning a DataFrame.

Parameters:
  • path (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.orc.

  • columns (list, default None) – If not None, only these columns will be read from the file. Output always follows the ordering of the file and not the columns list. This mirrors the original behaviour of .

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

  • **kwargs – Any additional kwargs are passed to pyarrow.

Return type:

DataFrame

Notes

Before using this function you should read the user guide about ORC and install optional dependencies.

pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=_NoDefault.no_default, dtype_backend=_NoDefault.no_default, **kwargs)[source]

Load a parquet object from the file path, returning a DataFrame.

Parameters:
  • path (str, path object or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir.

  • engine ({'auto', 'pyarrow', 'fastparquet'}, default 'auto') – Parquet library to use. If ‘auto’, then the option io.parquet.engine is used. The default io.parquet.engine behavior is to try ‘pyarrow’, falling back to ‘fastparquet’ if ‘pyarrow’ is unavailable.

  • columns (list, default=None) – If not None, only these columns will be read from the file.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.3.0.

  • use_nullable_dtypes (bool, default False) –

    If True, use dtypes that use pd.NA as missing value indicator for the resulting DataFrame. (only applicable for the pyarrow engine) As new dtypes are added that support pd.NA in the future, the output with this option will change to use those dtypes. Note: this is an experimental option, and behaviour (e.g. additional support dtypes) may change without notice.

    Deprecated since version 2.0.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

  • **kwargs – Any additional kwargs are passed to the engine.

Return type:

DataFrame

pandas.read_pickle(filepath_or_buffer, compression='infer', storage_options=None)[source]

Load pickled pandas object (or any object) from file.

Warning

Loading pickled data received from untrusted sources can be unsafe. See here.

Parameters:
  • filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary readlines() function. Also accepts URL. URL is not limited to S3 and GCS.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

Return type:

same type as object stored in file

See also

DataFrame.to_pickle

Pickle (serialize) DataFrame object to file.

Series.to_pickle

Pickle (serialize) Series object to file.

read_hdf

Read HDF5 file into a DataFrame.

read_sql

Read SQL query or database table into a DataFrame.

read_parquet

Load a parquet object, returning a DataFrame.

Notes

read_pickle is only guaranteed to be backwards compatible to pandas 0.20.3 provided the object was serialized with to_pickle.

Examples

>>> original_df = pd.DataFrame(
...     {"foo": range(5), "bar": range(5, 10)}
...    )  
>>> original_df  
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
>>> pd.to_pickle(original_df, "./dummy.pkl")  
>>> unpickled_df = pd.read_pickle("./dummy.pkl")  
>>> unpickled_df  
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
pandas.read_sas(filepath_or_buffer, *, format=None, index=None, encoding=None, chunksize=None, iterator=False, compression='infer')[source]

Read SAS files stored as either XPORT or SAS7BDAT format files.

Parameters:
  • filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.sas7bdat.

  • format (str {'xport', 'sas7bdat'} or None) – If None, file format is inferred from file extension. If ‘xport’ or ‘sas7bdat’, uses the corresponding format.

  • index (identifier of index column, defaults to None) – Identifier of column that should be used as index of the DataFrame.

  • encoding (str, default is None) – Encoding for text data. If None, text data are stored as raw bytes.

  • chunksize (int) –

    Read file chunksize lines at a time, returns iterator.

    Changed in version 1.2: TextFileReader is a context manager.

  • iterator (bool, defaults to False) –

    If True, returns an iterator for reading the file incrementally.

    Changed in version 1.2: TextFileReader is a context manager.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

Returns:

  • DataFrame if iterator=False and chunksize=None, else SAS7BDATReader

  • or XportReader

Return type:

DataFrame | ReaderBase

pandas.read_spss(path, usecols=None, convert_categoricals=True, dtype_backend=_NoDefault.no_default)[source]

Load an SPSS file from the file path, returning a DataFrame.

Parameters:
  • path (str or Path) – File path.

  • usecols (list-like, optional) – Return a subset of the columns. If None, return all columns.

  • convert_categoricals (bool, default is True) – Convert categorical columns into pd.Categorical.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Return type:

DataFrame

pandas.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default, dtype=None)[source]

Read SQL query or database table into a DataFrame.

This function is a convenience wrapper around read_sql_table and read_sql_query (for backward compatibility). It will delegate to the specific function depending on the provided input. A SQL query will be routed to read_sql_query, while a database table name will be routed to read_sql_table. Note that the delegated function might have more specific notes about their functionality not listed here.

Parameters:
  • sql (str or SQLAlchemy Selectable (select or text object)) – SQL query to be executed or a table name.

  • con (SQLAlchemy connectable, str, or sqlite3 connection) –

    Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported. The user is responsible for engine disposal and connection closure for the SQLAlchemy connectable; str connections are closed automatically. See here.

  • index_col (str or list of str, optional, default: None) – Column(s) to set as index(MultiIndex).

  • coerce_float (bool, default True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.

  • params (list, tuple or dict, optional, default: None) – List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}.

  • parse_dates (list or dict, default: None) –

    • List of column names to parse as dates.

    • Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

    • Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite.

  • columns (list, default: None) – List of column names to select from SQL table (only used when reading a table).

  • chunksize (int, default None) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

  • dtype (Type name or dict of columns) –

    Data type for data or columns. E.g. np.float64 or {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}. The argument is ignored if a table is passed instead of a query.

    New in version 2.0.0.

Return type:

DataFrame or Iterator[DataFrame]

See also

read_sql_table

Read SQL database table into a DataFrame.

read_sql_query

Read SQL query into a DataFrame.

Examples

Read data from SQL via either a SQL query or a SQL tablename. When using a SQLite database only SQL queries are accepted, providing only the SQL tablename will result in an error.

>>> from sqlite3 import connect
>>> conn = connect(':memory:')
>>> df = pd.DataFrame(data=[[0, '10/11/12'], [1, '12/11/10']],
...                   columns=['int_column', 'date_column'])
>>> df.to_sql('test_data', conn)
2
>>> pd.read_sql('SELECT int_column, date_column FROM test_data', conn)
   int_column date_column
0           0    10/11/12
1           1    12/11/10
>>> pd.read_sql('test_data', 'postgres:///db_name')  

Apply date parsing to columns through the parse_dates argument The parse_dates argument calls pd.to_datetime on the provided columns. Custom argument values for applying pd.to_datetime on a column are specified via a dictionary format:

>>> pd.read_sql('SELECT int_column, date_column FROM test_data',
...             conn,
...             parse_dates={"date_column": {"format": "%d/%m/%y"}})
   int_column date_column
0           0  2012-11-10
1           1  2010-11-12
pandas.read_sql_query(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, chunksize=None, dtype=None, dtype_backend=_NoDefault.no_default)[source]

Read SQL query into a DataFrame.

Returns a DataFrame corresponding to the result set of the query string. Optionally provide an index_col parameter to use one of the columns as the index, otherwise default integer index will be used.

Parameters:
  • sql (str SQL query or SQLAlchemy Selectable (select or text object)) – SQL query to be executed.

  • con (SQLAlchemy connectable, str, or sqlite3 connection) – Using SQLAlchemy makes it possible to use any DB supported by that library. If a DBAPI2 object, only sqlite3 is supported.

  • index_col (str or list of str, optional, default: None) – Column(s) to set as index(MultiIndex).

  • coerce_float (bool, default True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Useful for SQL result sets.

  • params (list, tuple or dict, optional, default: None) – List of parameters to pass to execute method. The syntax used to pass parameters is database driver dependent. Check your database driver documentation for which of the five syntax styles, described in PEP 249’s paramstyle, is supported. Eg. for psycopg2, uses %(name)s so use params={‘name’ : ‘value’}.

  • parse_dates (list or dict, default: None) –

    • List of column names to parse as dates.

    • Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times, or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

    • Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite.

  • chunksize (int, default None) – If specified, return an iterator where chunksize is the number of rows to include in each chunk.

  • dtype (Type name or dict of columns) –

    Data type for data or columns. E.g. np.float64 or {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’}.

    New in version 1.3.0.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Return type:

DataFrame or Iterator[DataFrame]

See also

read_sql_table

Read SQL database table into a DataFrame.

read_sql

Read SQL query or database table into a DataFrame.

Notes

Any datetime values with time zone information parsed via the parse_dates parameter will be converted to UTC.

pandas.read_sql_table(table_name, con, schema=None, index_col=None, coerce_float=True, parse_dates=None, columns=None, chunksize=None, dtype_backend=_NoDefault.no_default)[source]

Read SQL database table into a DataFrame.

Given a table name and a SQLAlchemy connectable, returns a DataFrame. This function does not support DBAPI connections.

Parameters:
  • table_name (str) – Name of SQL table in database.

  • con (SQLAlchemy connectable or str) – A database URI could be provided as str. SQLite DBAPI connection mode not supported.

  • schema (str, default None) – Name of SQL schema in database to query (if database flavor supports this). Uses default schema if None (default).

  • index_col (str or list of str, optional, default: None) – Column(s) to set as index(MultiIndex).

  • coerce_float (bool, default True) – Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point. Can result in loss of Precision.

  • parse_dates (list or dict, default None) –

    • List of column names to parse as dates.

    • Dict of {column_name: format string} where format string is strftime compatible in case of parsing string times or is one of (D, s, ns, ms, us) in case of parsing integer timestamps.

    • Dict of {column_name: arg dict}, where the arg dict corresponds to the keyword arguments of pandas.to_datetime() Especially useful with databases without native Datetime support, such as SQLite.

  • columns (list, default None) – List of column names to select from SQL table.

  • chunksize (int, default None) – If specified, returns an iterator where chunksize is the number of rows to include in each chunk.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

A SQL table is returned as two-dimensional data structure with labeled axes.

Return type:

DataFrame or Iterator[DataFrame]

See also

read_sql_query

Read SQL query into a DataFrame.

read_sql

Read SQL query or database table into a DataFrame.

Notes

Any datetime values with time zone information will be converted to UTC.

Examples

>>> pd.read_sql_table('table_name', 'postgres:///db_name')  
pandas.read_stata(filepath_or_buffer, *, convert_dates=True, convert_categoricals=True, index_col=None, convert_missing=False, preserve_dtypes=True, columns=None, order_categoricals=True, chunksize=None, iterator=False, compression='infer', storage_options=None)[source]

Read Stata file into DataFrame.

Parameters:
  • filepath_or_buffer (str, path object or file-like object) –

    Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.dta.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

  • convert_dates (bool, default True) – Convert date variables to DataFrame time values.

  • convert_categoricals (bool, default True) – Read value labels and convert columns to Categorical/Factor variables.

  • index_col (str, optional) – Column to set as index.

  • convert_missing (bool, default False) – Flag indicating whether to convert missing values to their Stata representations. If False, missing values are replaced with nan. If True, columns containing missing values are returned with object data types and missing values are represented by StataMissingValue objects.

  • preserve_dtypes (bool, default True) – Preserve Stata datatypes. If False, numeric data are upcast to pandas default types for foreign data (float64 or int64).

  • columns (list or None) – Columns to retain. Columns will be returned in the given order. None returns all columns.

  • order_categoricals (bool, default True) – Flag indicating whether converted categorical data are ordered.

  • chunksize (int, default None) – Return StataReader object for iterations, returns chunks with given number of lines.

  • iterator (bool, default False) – Return StataReader object.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

Return type:

DataFrame or StataReader

See also

io.stata.StataReader

Low-level reader for Stata data files.

DataFrame.to_stata

Export Stata data files.

Notes

Categorical variables read through an iterator may not have the same categories and dtype. This occurs when a variable stored in a DTA file is associated to an incomplete set of value labels that only label a strict subset of the values.

Examples

Creating a dummy stata for this example

>>> df = pd.DataFrame({'animal': ['falcon', 'parrot', 'falcon', 'parrot'],
...                     'speed': [350, 18, 361, 15]})  
>>> df.to_stata('animals.dta')  

Read a Stata dta file:

>>> df = pd.read_stata('animals.dta')  

Read a Stata dta file in 10,000 line chunks:

>>> values = np.random.randint(0, 10, size=(20_000, 1), dtype="uint8")  
>>> df = pd.DataFrame(values, columns=["i"])  
>>> df.to_stata('filename.dta')  
>>> with pd.read_stata('filename.dta', chunksize=10000) as itr: 
>>>     for chunk in itr:
...         # Operate on a single chunk, e.g., chunk.mean()
...         pass  
pandas.read_table(filepath_or_buffer, *, sep=_NoDefault.no_default, delimiter=None, header='infer', names=_NoDefault.no_default, index_col=None, usecols=None, dtype=None, engine=None, converters=None, true_values=None, false_values=None, skipinitialspace=False, skiprows=None, skipfooter=0, nrows=None, na_values=None, keep_default_na=True, na_filter=True, verbose=False, skip_blank_lines=True, parse_dates=False, infer_datetime_format=_NoDefault.no_default, keep_date_col=False, date_parser=_NoDefault.no_default, date_format=None, dayfirst=False, cache_dates=True, iterator=False, chunksize=None, compression='infer', thousands=None, decimal='.', lineterminator=None, quotechar='"', quoting=0, doublequote=True, escapechar=None, comment=None, encoding=None, encoding_errors='strict', dialect=None, on_bad_lines='error', delim_whitespace=False, low_memory=True, memory_map=False, float_precision=None, storage_options=None, dtype_backend=_NoDefault.no_default)[source]

Read general delimited file into DataFrame.

Also supports optionally iterating or breaking of the file into chunks.

Additional help can be found in the online docs for IO Tools.

Parameters:
  • filepath_or_buffer (str, path object or file-like object) –

    Any valid string path is acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.csv.

    If you want to pass in a path object, pandas accepts any os.PathLike.

    By file-like object, we refer to objects with a read() method, such as a file handle (e.g. via builtin open function) or StringIO.

  • sep (str, default '\t' (tab-stop)) – Delimiter to use. If sep is None, the C engine cannot automatically detect the separator, but the Python parsing engine can, meaning the latter will be used and automatically detect the separator by Python’s builtin sniffer tool, csv.Sniffer. In addition, separators longer than 1 character and different from '\s+' will be interpreted as regular expressions and will also force the use of the Python parsing engine. Note that regex delimiters are prone to ignoring quoted data. Regex example: '\r\t'.

  • delimiter (str, default None) – Alias for sep.

  • header (int, list of int, None, default 'infer') – Row number(s) to use as the column names, and the start of the data. Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0 and column names are inferred from the first line of the file, if column names are passed explicitly then the behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names. The header can be a list of integers that specify row locations for a multi-index on the columns e.g. [0,1,3]. Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the first line of data rather than the first line of the file.

  • names (array-like, optional) – List of column names to use. If the file contains a header row, then you should explicitly pass header=0 to override the column names. Duplicates in this list are not allowed.

  • index_col (int, str, sequence of int / str, or False, optional, default None) –

    Column(s) to use as the row labels of the DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex is used.

    Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you have a malformed file with delimiters at the end of each line.

  • usecols (list-like or callable, optional) –

    Return a subset of the columns. If list-like, all elements must either be positional (i.e. integer indices into the document columns) or strings that correspond to column names provided either by the user in names or inferred from the document header row(s). If names are given, the document header row(s) are not taken into account. For example, a valid list-like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo', 'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data, usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.

    If callable, the callable function will be evaluated against the column names, returning names where the callable function evaluates to True. An example of a valid callable argument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using this parameter results in much faster parsing time and lower memory usage.

  • dtype (Type name or dict of column -> type, optional) –

    Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

    New in version 1.5.0: Support for defaultdict was added. Specify a defaultdict as input where the default determines the dtype of the columns which are not explicitly listed.

  • engine ({'c', 'python', 'pyarrow'}, optional) –

    Parser engine to use. The C and pyarrow engines are faster, while the python engine is currently more feature-complete. Multithreading is currently only supported by the pyarrow engine.

    New in version 1.4.0: The “pyarrow” engine was added as an experimental engine, and some features are unsupported, or may not work correctly, with this engine.

  • converters (dict, optional) – Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

  • true_values (list, optional) – Values to consider as True in addition to case-insensitive variants of “True”.

  • false_values (list, optional) – Values to consider as False in addition to case-insensitive variants of “False”.

  • skipinitialspace (bool, default False) – Skip spaces after delimiter.

  • skiprows (list-like, int or callable, optional) –

    Line numbers to skip (0-indexed) or number of lines to skip (int) at the start of the file.

    If callable, the callable function will be evaluated against the row indices, returning True if the row should be skipped and False otherwise. An example of a valid callable argument would be lambda x: x in [0, 2].

  • skipfooter (int, default 0) – Number of lines at bottom of file to skip (Unsupported with engine=’c’).

  • nrows (int, optional) – Number of rows of file to read. Useful for reading pieces of large files.

  • na_values (scalar, str, list-like, or dict, optional) – Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default the following values are interpreted as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’, ‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘None’, ‘n/a’, ‘nan’, ‘null’.

  • keep_default_na (bool, default True) –

    Whether or not to include the default NaN values when parsing the data. Depending on whether na_values is passed in, the behavior is as follows:

    • If keep_default_na is True, and na_values are specified, na_values is appended to the default NaN values used for parsing.

    • If keep_default_na is True, and na_values are not specified, only the default NaN values are used for parsing.

    • If keep_default_na is False, and na_values are specified, only the NaN values specified na_values are used for parsing.

    • If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.

    Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will be ignored.

  • na_filter (bool, default True) – Detect missing value markers (empty strings and the value of na_values). In data without any NAs, passing na_filter=False can improve the performance of reading a large file.

  • verbose (bool, default False) – Indicate number of NA values placed in non-numeric columns.

  • skip_blank_lines (bool, default True) – If True, skip over blank lines rather than interpreting as NaN values.

  • parse_dates (bool or list of int or names or list of lists or dict, default False) –

    The behavior is as follows:

    • boolean. If True -> try parsing the index.

    • list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.

    • list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

    • dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

    If a column or index cannot be represented as an array of datetimes, say because of an unparsable value or a mixture of timezones, the column or index will be returned unaltered as an object data type. For non-standard datetime parsing, use pd.to_datetime after pd.read_csv.

    Note: A fast-path exists for iso8601-formatted dates.

  • infer_datetime_format (bool, default False) –

    If True and parse_dates is enabled, pandas will attempt to infer the format of the datetime strings in the columns, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by 5-10x.

    Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.

  • keep_date_col (bool, default False) – If True and parse_dates specifies combining multiple columns then keep the original columns.

  • date_parser (function, optional) –

    Function to use for converting a sequence of string columns to an array of datetime instances. The default uses dateutil.parser.parser to do the conversion. Pandas will try to call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more strings (corresponding to the columns defined by parse_dates) as arguments.

    Deprecated since version 2.0.0: Use date_format instead, or read in as object and then apply to_datetime() as-needed.

  • date_format (str or dict of column -> format, default None) –

    If used in conjunction with parse_dates, will parse dates according to this format. For anything more complex, please read in as object and then apply to_datetime() as-needed.

    New in version 2.0.0.

  • dayfirst (bool, default False) – DD/MM format dates, international and European format.

  • cache_dates (bool, default True) – If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.

  • iterator (bool, default False) –

    Return TextFileReader object for iteration or getting chunks with get_chunk().

    Changed in version 1.2: TextFileReader is a context manager.

  • chunksize (int, optional) –

    Return TextFileReader object for iteration. See the IO Tools docs for more information on iterator and chunksize.

    Changed in version 1.2: TextFileReader is a context manager.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • thousands (str, optional) – Thousands separator.

  • decimal (str, default '.') – Character to recognize as decimal point (e.g. use ‘,’ for European data).

  • lineterminator (str (length 1), optional) – Character to break file into lines. Only valid with C parser.

  • quotechar (str (length 1), optional) – The character used to denote the start and end of a quoted item. Quoted items can include the delimiter and it will be ignored.

  • quoting (int or csv.QUOTE_* instance, default 0) – Control field quoting behavior per csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).

  • doublequote (bool, default True) – When quotechar is specified and quoting is not QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements INSIDE a field as a single quotechar element.

  • escapechar (str (length 1), optional) – One-character string used to escape other characters.

  • comment (str, optional) – Indicates remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by skiprows. For example, if comment='#', parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated as the header.

  • encoding (str, optional, default "utf-8") –

    Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of Python standard encodings .

    Changed in version 1.2: When encoding is None, errors="replace" is passed to open(). Otherwise, errors="strict" is passed to open(). This behavior was previously only the case for engine="python".

    Changed in version 1.3.0: encoding_errors is a new argument. encoding has no longer an influence on how encoding errors are handled.

  • encoding_errors (str, optional, default "strict") –

    How encoding errors are treated. List of possible values .

    New in version 1.3.0.

  • dialect (str or csv.Dialect, optional) – If provided, this parameter will override values (default or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace, quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.Dialect documentation for more details.

  • on_bad_lines ({'error', 'warn', 'skip'} or callable, default 'error') –

    Specifies what to do upon encountering a bad line (a line with too many fields). Allowed values are :

    • ’error’, raise an Exception when a bad line is encountered.

    • ’warn’, raise a warning when a bad line is encountered and skip that line.

    • ’skip’, skip bad lines without raising or warning when they are encountered.

    New in version 1.3.0.

    New in version 1.4.0:

    • callable, function with signature (bad_line: list[str]) -> list[str] | None that will process a single bad line. bad_line is a list of strings split by the sep. If the function returns None, the bad line will be ignored. If the function returns a new list of strings with more elements than expected, a ParserWarning will be emitted while dropping extra elements. Only supported when engine="python"

  • delim_whitespace (bool, default False) – Specifies whether or not whitespace (e.g. ' ' or '    ') will be used as the sep. Equivalent to setting sep='\s+'. If this option is set to True, nothing should be passed in for the delimiter parameter.

  • low_memory (bool, default True) – Internally process the file in chunks, resulting in lower memory use while parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize or iterator parameter to return the data in chunks. (Only valid with C parser).

  • memory_map (bool, default False) – If a filepath is provided for filepath_or_buffer, map the file object directly onto memory and access the data directly from there. Using this option can improve performance because there is no longer any I/O overhead.

  • float_precision (str, optional) –

    Specifies which converter the C engine should use for floating-point values. The options are None or ‘high’ for the ordinary converter, ‘legacy’ for the original lower precision pandas converter, and ‘round_trip’ for the round-trip converter.

    Changed in version 1.2.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

A comma-separated values (csv) file is returned as two-dimensional data structure with labeled axes.

Return type:

DataFrame or TextFileReader

See also

DataFrame.to_csv

Write DataFrame to a comma-separated values (csv) file.

read_csv

Read a comma-separated values (csv) file into DataFrame.

read_fwf

Read a table of fixed-width formatted lines into DataFrame.

Examples

>>> pd.read_table('data.csv')  
pandas.read_xml(path_or_buffer, *, xpath='./*', namespaces=None, elems_only=False, attrs_only=False, names=None, dtype=None, converters=None, parse_dates=None, encoding='utf-8', parser='lxml', stylesheet=None, iterparse=None, compression='infer', storage_options=None, dtype_backend=_NoDefault.no_default)[source]

Read XML document into a DataFrame object.

New in version 1.3.0.

Parameters:
  • path_or_buffer (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a read() function. The string can be any valid XML string or a path. The string can further be a URL. Valid URL schemes include http, ftp, s3, and file.

  • xpath (str, optional, default './*') – The XPath to parse required set of nodes for migration to DataFrame. XPath should return a collection of elements and not a single element. Note: The etree parser supports limited XPath expressions. For more complex XPath, use lxml which requires installation.

  • namespaces (dict, optional) –

    The namespaces defined in XML document as dicts with key being namespace prefix and value the URI. There is no need to include all namespaces in XML, only the ones used in xpath expression. Note: if XML document uses default namespace denoted as xmlns=’<URI>’ without a prefix, you must assign any temporary namespace prefix such as ‘doc’ to the URI in order to parse underlying nodes and/or attributes. For example,

    namespaces = {"doc": "https://example.com"}
    

  • elems_only (bool, optional, default False) – Parse only the child elements at the specified xpath. By default, all child elements and non-empty text nodes are returned.

  • attrs_only (bool, optional, default False) – Parse only the attributes at the specified xpath. By default, all attributes are returned.

  • names (list-like, optional) – Column names for DataFrame of parsed XML data. Use this parameter to rename original element names and distinguish same named elements and attributes.

  • dtype (Type name or dict of column -> type, optional) –

    Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values settings to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

    New in version 1.5.0.

  • converters (dict, optional) –

    Dict of functions for converting values in certain columns. Keys can either be integers or column labels.

    New in version 1.5.0.

  • parse_dates (bool or list of int or names or list of lists or dict, default False) –

    Identifiers to parse index or columns to datetime. The behavior is as follows:

    • boolean. If True -> try parsing the index.

    • list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.

    • list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.

    • dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’

    New in version 1.5.0.

  • encoding (str, optional, default 'utf-8') – Encoding of XML document.

  • parser ({'lxml','etree'}, default 'lxml') – Parser module to use for retrieval of data. Only ‘lxml’ and ‘etree’ are supported. With ‘lxml’ more complex XPath searches and ability to use XSLT stylesheet are supported.

  • stylesheet (str, path object or file-like object) – A URL, file-like object, or a raw string containing an XSLT script. This stylesheet should flatten complex, deeply nested XML documents for easier parsing. To use this feature you must have lxml module installed and specify ‘lxml’ as parser. The xpath must reference nodes of transformed XML document generated after XSLT transformation and not the original XML document. Only XSLT 1.0 scripts and not later versions is currently supported.

  • iterparse (dict, optional) –

    The nodes or attributes to retrieve in iterparsing of XML document as a dict with key being the name of repeating element and value being list of elements or attribute names that are descendants of the repeated element. Note: If this option is used, it will replace xpath parsing and unlike xpath, descendants do not need to relate to each other but can exist any where in document under the repeating element. This memory- efficient method should be used for very large XML files (500MB, 1GB, or 5GB+). For example,

    iterparse = {"row_element": ["child_elem", "attr", "grandchild_elem"]}
    

    New in version 1.5.0.

  • compression (str or dict, default 'infer') –

    For on-the-fly decompression of on-disk data. If ‘infer’ and ‘path_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). If using ‘zip’ or ‘tar’, the ZIP file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdDecompressor or tarfile.TarFile, respectively. As an example, the following could be passed for Zstandard decompression using a custom compression dictionary: compression={'method': 'zstd', 'dict_data': my_compression_dict}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

A DataFrame.

Return type:

df

See also

read_json

Convert a JSON string to pandas object.

read_html

Read HTML tables into a list of DataFrame objects.

Notes

This method is best designed to import shallow XML documents in following format which is the ideal fit for the two-dimensions of a DataFrame (row by column).

<root>
    <row>
      <column1>data</column1>
      <column2>data</column2>
      <column3>data</column3>
      ...
   </row>
   <row>
      ...
   </row>
   ...
</root>

As a file format, XML documents can be designed any way including layout of elements and attributes as long as it conforms to W3C specifications. Therefore, this method is a convenience handler for a specific flatter design and not all possible XML structures.

However, for more complex XML documents, stylesheet allows you to temporarily redesign original document with XSLT (a special purpose language) for a flatter version for migration to a DataFrame.

This function will always return a single DataFrame or raise exceptions due to issues with XML document, xpath, or other parameters.

See the read_xml documentation in the IO section of the docs for more information in using this method to parse XML files to DataFrames.

Examples

>>> xml = '''<?xml version='1.0' encoding='utf-8'?>
... <data xmlns="http://example.com">
...  <row>
...    <shape>square</shape>
...    <degrees>360</degrees>
...    <sides>4.0</sides>
...  </row>
...  <row>
...    <shape>circle</shape>
...    <degrees>360</degrees>
...    <sides/>
...  </row>
...  <row>
...    <shape>triangle</shape>
...    <degrees>180</degrees>
...    <sides>3.0</sides>
...  </row>
... </data>'''
>>> df = pd.read_xml(xml)
>>> df
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0
>>> xml = '''<?xml version='1.0' encoding='utf-8'?>
... <data>
...   <row shape="square" degrees="360" sides="4.0"/>
...   <row shape="circle" degrees="360"/>
...   <row shape="triangle" degrees="180" sides="3.0"/>
... </data>'''
>>> df = pd.read_xml(xml, xpath=".//row")
>>> df
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0
>>> xml = '''<?xml version='1.0' encoding='utf-8'?>
... <doc:data xmlns:doc="https://example.com">
...   <doc:row>
...     <doc:shape>square</doc:shape>
...     <doc:degrees>360</doc:degrees>
...     <doc:sides>4.0</doc:sides>
...   </doc:row>
...   <doc:row>
...     <doc:shape>circle</doc:shape>
...     <doc:degrees>360</doc:degrees>
...     <doc:sides/>
...   </doc:row>
...   <doc:row>
...     <doc:shape>triangle</doc:shape>
...     <doc:degrees>180</doc:degrees>
...     <doc:sides>3.0</doc:sides>
...   </doc:row>
... </doc:data>'''
>>> df = pd.read_xml(xml,
...                  xpath="//doc:row",
...                  namespaces={"doc": "https://example.com"})
>>> df
      shape  degrees  sides
0    square      360    4.0
1    circle      360    NaN
2  triangle      180    3.0
pandas.set_eng_float_format(accuracy=3, use_eng_prefix=False)[source]

Format float representation in DataFrame with SI notation.

Parameters:
  • accuracy (int, default 3) – Number of decimal digits after the floating point.

  • use_eng_prefix (bool, default False) – Whether to represent a value with SI prefixes.

Return type:

None

Examples

>>> df = pd.DataFrame([1e-9, 1e-3, 1, 1e3, 1e6])
>>> df
              0
0  1.000000e-09
1  1.000000e-03
2  1.000000e+00
3  1.000000e+03
4  1.000000e+06
>>> pd.set_eng_float_format(accuracy=1)
>>> df
         0
0  1.0E-09
1  1.0E-03
2  1.0E+00
3  1.0E+03
4  1.0E+06
>>> pd.set_eng_float_format(use_eng_prefix=True)
>>> df
        0
0  1.000n
1  1.000m
2   1.000
3  1.000k
4  1.000M
>>> pd.set_eng_float_format(accuracy=1, use_eng_prefix=True)
>>> df
      0
0  1.0n
1  1.0m
2   1.0
3  1.0k
4  1.0M
>>> pd.set_option("display.float_format", None)  # unset option
pandas.show_versions(as_json=False)[source]

Provide useful information, important for bug reports.

It comprises info about hosting operation system, pandas version, and versions of other installed relative packages.

Parameters:

as_json (str or bool, default False) –

  • If False, outputs info in a human readable form to the console.

  • If str, it will be considered as a path to a file. Info will be written to that file in JSON format.

  • If True, outputs info in JSON format to the console.

Return type:

None

pandas.test(extra_args=None)[source]

Run the pandas test suite using pytest.

By default, runs with the marks –skip-slow, –skip-network, –skip-db

Parameters:

extra_args (list[str], default None) – Extra marks to run the tests.

Return type:

None

pandas.timedelta_range(start=None, end=None, periods=None, freq=None, name=None, closed=None, *, unit=None)[source]

Return a fixed frequency TimedeltaIndex with day as the default.

Parameters:
  • start (str or timedelta-like, default None) – Left bound for generating timedeltas.

  • end (str or timedelta-like, default None) – Right bound for generating timedeltas.

  • periods (int, default None) – Number of periods to generate.

  • freq (str or DateOffset, default 'D') – Frequency strings can have multiples, e.g. ‘5H’.

  • name (str, default None) – Name of the resulting TimedeltaIndex.

  • closed (str, default None) – Make the interval closed with respect to the given frequency to the ‘left’, ‘right’, or both sides (None).

  • unit (str, default None) –

    Specify the desired resolution of the result.

    New in version 2.0.0.

Return type:

TimedeltaIndex

Notes

Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted, the resulting TimedeltaIndex will have periods linearly spaced elements between start and end (closed on both sides).

To learn more about the frequency strings, please see this link.

Examples

>>> pd.timedelta_range(start='1 day', periods=4)
TimedeltaIndex(['1 days', '2 days', '3 days', '4 days'],
               dtype='timedelta64[ns]', freq='D')

The closed parameter specifies which endpoint is included. The default behavior is to include both endpoints.

>>> pd.timedelta_range(start='1 day', periods=4, closed='right')
TimedeltaIndex(['2 days', '3 days', '4 days'],
               dtype='timedelta64[ns]', freq='D')

The freq parameter specifies the frequency of the TimedeltaIndex. Only fixed frequencies can be passed, non-fixed frequencies such as ‘M’ (month end) will raise.

>>> pd.timedelta_range(start='1 day', end='2 days', freq='6H')
TimedeltaIndex(['1 days 00:00:00', '1 days 06:00:00', '1 days 12:00:00',
                '1 days 18:00:00', '2 days 00:00:00'],
               dtype='timedelta64[ns]', freq='6H')

Specify start, end, and periods; the frequency is generated automatically (linearly spaced).

>>> pd.timedelta_range(start='1 day', end='5 days', periods=4)
TimedeltaIndex(['1 days 00:00:00', '2 days 08:00:00', '3 days 16:00:00',
                '5 days 00:00:00'],
               dtype='timedelta64[ns]', freq=None)

Specify a unit

>>> pd.timedelta_range("1 Day", periods=3, freq="100000D", unit="s")
TimedeltaIndex(['1 days 00:00:00', '100001 days 00:00:00',
                '200001 days 00:00:00'],
               dtype='timedelta64[s]', freq='100000D')
pandas.to_datetime(arg, errors='raise', dayfirst=False, yearfirst=False, utc=False, format=None, exact=_NoDefault.no_default, unit=None, infer_datetime_format=_NoDefault.no_default, origin='unix', cache=True)[source]

Convert argument to datetime.

This function converts a scalar, array-like, Series or DataFrame/dict-like to a pandas datetime object.

Parameters:
  • arg (int, float, str, datetime, list, tuple, 1-d array, Series, DataFrame/dict-like) – The object to convert to a datetime. If a DataFrame is provided, the method expects minimally the following columns: "year", "month", "day".

  • errors ({'ignore', 'raise', 'coerce'}, default 'raise') –

    • If 'raise', then invalid parsing will raise an exception.

    • If 'coerce', then invalid parsing will be set as NaT.

    • If 'ignore', then invalid parsing will return the input.

  • dayfirst (bool, default False) –

    Specify a date parse order if arg is str or is list-like. If True, parses dates with the day first, e.g. "10/11/12" is parsed as 2012-11-10.

    Warning

    dayfirst=True is not strict, but will prefer to parse with day first.

  • yearfirst (bool, default False) –

    Specify a date parse order if arg is str or is list-like.

    • If True parses dates with the year first, e.g. "10/11/12" is parsed as 2010-11-12.

    • If both dayfirst and yearfirst are True, yearfirst is preceded (same as dateutil).

    Warning

    yearfirst=True is not strict, but will prefer to parse with year first.

  • utc (bool, default False) –

    Control timezone-related parsing, localization and conversion.

    • If True, the function always returns a timezone-aware UTC-localized Timestamp, Series or DatetimeIndex. To do this, timezone-naive inputs are localized as UTC, while timezone-aware inputs are converted to UTC.

    • If False (default), inputs will not be coerced to UTC. Timezone-naive inputs will remain naive, while timezone-aware ones will keep their time offsets. Limitations exist for mixed offsets (typically, daylight savings), see Examples section for details.

    See also: pandas general documentation about timezone conversion and localization.

  • format (str, default None) –

    The strftime to parse time, e.g. "%d/%m/%Y". See strftime documentation for more information on choices, though note that "%f" will parse all the way up to nanoseconds. You can also pass:

    • ”ISO8601”, to parse any ISO8601 time string (not necessarily in exactly the same format);

    • ”mixed”, to infer the format for each element individually. This is risky, and you should probably use it along with dayfirst.

  • exact (bool, default True) –

    Control how format is used:

    • If True, require an exact format match.

    • If False, allow the format to match anywhere in the target string.

    Cannot be used alongside format='ISO8601' or format='mixed'.

  • unit (str, default 'ns') – The unit of the arg (D,s,ms,us,ns) denote the unit, which is an integer or float number. This will be based off the origin. Example, with unit='ms' and origin='unix', this would calculate the number of milliseconds to the unix epoch start.

  • infer_datetime_format (bool, default False) –

    If True and no format is given, attempt to infer the format of the datetime strings based on the first non-NaN element, and if it can be inferred, switch to a faster method of parsing them. In some cases this can increase the parsing speed by ~5-10x.

    Deprecated since version 2.0.0: A strict version of this argument is now the default, passing it has no effect.

  • origin (scalar, default 'unix') –

    Define the reference date. The numeric values would be parsed as number of units (defined by unit) since this reference date.

    • If 'unix' (or POSIX) time; origin is set to 1970-01-01.

    • If 'julian', unit must be 'D', and origin is set to beginning of Julian Calendar. Julian day number 0 is assigned to the day starting at noon on January 1, 4713 BC.

    • If Timestamp convertible (Timestamp, dt.datetime, np.datetimt64 or date string), origin is set to Timestamp identified by origin.

    • If a float or integer, origin is the millisecond difference relative to 1970-01-01.

  • cache (bool, default True) – If True, use a cache of unique, converted dates to apply the datetime conversion. May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets. The cache is only used when there are at least 50 values. The presence of out-of-bounds values will render the cache unusable and may slow down parsing.

Returns:

If parsing succeeded. Return type depends on input (types in parenthesis correspond to fallback in case of unsuccessful timezone or out-of-range timestamp parsing):

Return type:

datetime

Raises:
  • ParserError – When parsing a date from string fails.

  • ValueError – When another datetime conversion error happens. For example when one of ‘year’, ‘month’, day’ columns is missing in a DataFrame, or when a Timezone-aware datetime.datetime is found in an array-like of mixed time offsets, and utc=False.

See also

DataFrame.astype

Cast argument to a specified dtype.

to_timedelta

Convert argument to timedelta.

convert_dtypes

Convert dtypes.

Notes

Many input types are supported, and lead to different output types:

  • scalars can be int, float, str, datetime object (from stdlib datetime module or numpy). They are converted to Timestamp when possible, otherwise they are converted to datetime.datetime. None/NaN/null scalars are converted to NaT.

  • array-like can contain int, float, str, datetime objects. They are converted to DatetimeIndex when possible, otherwise they are converted to Index with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

  • Series are converted to Series with datetime64 dtype when possible, otherwise they are converted to Series with object dtype, containing datetime.datetime. None/NaN/null entries are converted to NaT in both cases.

  • DataFrame/dict-like are converted to Series with datetime64 dtype. For each row a datetime is created from assembling the various dataframe columns. Column keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same.

The following causes are responsible for datetime.datetime objects being returned (possibly inside an Index or a Series with object dtype) instead of a proper pandas designated type (Timestamp, DatetimeIndex or Series with datetime64 dtype):

  • when any input element is before Timestamp.min or after Timestamp.max, see timestamp limitations.

  • when utc=False (default) and the input is an array-like or Series containing mixed naive/aware datetime, or aware with mixed time offsets. Note that this happens in the (quite frequent) situation when the timezone has a daylight savings policy. In that case you may wish to use utc=True.

Examples

Handling various input formats

Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like [‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same

>>> df = pd.DataFrame({'year': [2015, 2016],
...                    'month': [2, 3],
...                    'day': [4, 5]})
>>> pd.to_datetime(df)
0   2015-02-04
1   2016-03-05
dtype: datetime64[ns]

Using a unix epoch time

>>> pd.to_datetime(1490195805, unit='s')
Timestamp('2017-03-22 15:16:45')
>>> pd.to_datetime(1490195805433502912, unit='ns')
Timestamp('2017-03-22 15:16:45.433502912')

Warning

For float arg, precision rounding might happen. To prevent unexpected behavior use a fixed-width exact type.

Using a non-unix epoch origin

>>> pd.to_datetime([1, 2, 3], unit='D',
...                origin=pd.Timestamp('1960-01-01'))
DatetimeIndex(['1960-01-02', '1960-01-03', '1960-01-04'],
              dtype='datetime64[ns]', freq=None)

Differences with strptime behavior

"%f" will parse all the way up to nanoseconds.

>>> pd.to_datetime('2018-10-26 12:00:00.0000000011',
...                format='%Y-%m-%d %H:%M:%S.%f')
Timestamp('2018-10-26 12:00:00.000000001')

Non-convertible date/times

If a date does not meet the timestamp limitations, passing errors='ignore' will return the original input instead of raising any exception.

Passing errors='coerce' will force an out-of-bounds date to NaT, in addition to forcing non-dates (or non-parseable dates) to NaT.

>>> pd.to_datetime('13000101', format='%Y%m%d', errors='ignore')
'13000101'
>>> pd.to_datetime('13000101', format='%Y%m%d', errors='coerce')
NaT

Timezones and time offsets

The default behaviour (utc=False) is as follows:

  • Timezone-naive inputs are converted to timezone-naive DatetimeIndex:

>>> pd.to_datetime(['2018-10-26 12:00:00', '2018-10-26 13:00:15'])
DatetimeIndex(['2018-10-26 12:00:00', '2018-10-26 13:00:15'],
              dtype='datetime64[ns]', freq=None)
  • Timezone-aware inputs with constant time offset are converted to timezone-aware DatetimeIndex:

>>> pd.to_datetime(['2018-10-26 12:00 -0500', '2018-10-26 13:00 -0500'])
DatetimeIndex(['2018-10-26 12:00:00-05:00', '2018-10-26 13:00:00-05:00'],
              dtype='datetime64[ns, UTC-05:00]', freq=None)
  • However, timezone-aware inputs with mixed time offsets (for example issued from a timezone with daylight savings, such as Europe/Paris) are not successfully converted to a DatetimeIndex. Instead a simple Index containing datetime.datetime objects is returned:

>>> pd.to_datetime(['2020-10-25 02:00 +0200', '2020-10-25 04:00 +0100'])
Index([2020-10-25 02:00:00+02:00, 2020-10-25 04:00:00+01:00],
      dtype='object')
  • A mix of timezone-aware and timezone-naive inputs is also converted to a simple Index containing datetime.datetime objects:

>>> from datetime import datetime
>>> pd.to_datetime(["2020-01-01 01:00:00-01:00", datetime(2020, 1, 1, 3, 0)])
Index([2020-01-01 01:00:00-01:00, 2020-01-01 03:00:00], dtype='object')

Setting utc=True solves most of the above issues:

  • Timezone-naive inputs are localized as UTC

>>> pd.to_datetime(['2018-10-26 12:00', '2018-10-26 13:00'], utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2018-10-26 13:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
  • Timezone-aware inputs are converted to UTC (the output represents the exact same datetime, but viewed from the UTC time offset +00:00).

>>> pd.to_datetime(['2018-10-26 12:00 -0530', '2018-10-26 12:00 -0500'],
...                utc=True)
DatetimeIndex(['2018-10-26 17:30:00+00:00', '2018-10-26 17:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
  • Inputs can contain both string or datetime, the above rules still apply

>>> pd.to_datetime(['2018-10-26 12:00', datetime(2020, 1, 1, 18)], utc=True)
DatetimeIndex(['2018-10-26 12:00:00+00:00', '2020-01-01 18:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
pandas.to_numeric(arg, errors='raise', downcast=None, dtype_backend=_NoDefault.no_default)[source]

Convert argument to a numeric type.

The default return dtype is float64 or int64 depending on the data supplied. Use the downcast parameter to obtain other dtypes.

Please note that precision loss may occur if really large numbers are passed in. Due to the internal limitations of ndarray, if numbers smaller than -9223372036854775808 (np.iinfo(np.int64).min) or larger than 18446744073709551615 (np.iinfo(np.uint64).max) are passed in, it is very likely they will be converted to float so that they can be stored in an ndarray. These warnings apply similarly to Series since it internally leverages ndarray.

Parameters:
  • arg (scalar, list, tuple, 1-d array, or Series) – Argument to be converted.

  • errors ({'ignore', 'raise', 'coerce'}, default 'raise') –

    • If ‘raise’, then invalid parsing will raise an exception.

    • If ‘coerce’, then invalid parsing will be set as NaN.

    • If ‘ignore’, then invalid parsing will return the input.

  • downcast (str, default None) –

    Can be ‘integer’, ‘signed’, ‘unsigned’, or ‘float’. If not None, and if the data has been successfully cast to a numerical dtype (or if the data was numeric to begin with), downcast that resulting data to the smallest numerical dtype possible according to the following rules:

    • ’integer’ or ‘signed’: smallest signed int dtype (min.: np.int8)

    • ’unsigned’: smallest unsigned int dtype (min.: np.uint8)

    • ’float’: smallest float dtype (min.: np.float32)

    As this behaviour is separate from the core conversion to numeric values, any errors raised during the downcasting will be surfaced regardless of the value of the ‘errors’ input.

    In addition, downcasting will only occur if the size of the resulting data’s dtype is strictly larger than the dtype it is to be cast to, so if none of the dtypes checked satisfy that specification, no downcasting will be performed on the data.

  • dtype_backend ({"numpy_nullable", "pyarrow"}, defaults to NumPy backed DataFrames) –

    Which dtype_backend to use, e.g. whether a DataFrame should have NumPy arrays, nullable dtypes are used for all dtypes that have a nullable implementation when “numpy_nullable” is set, pyarrow is used for all dtypes if “pyarrow” is set.

    The dtype_backends are still experimential.

    New in version 2.0.

Returns:

Numeric if parsing succeeded. Return type depends on input. Series if Series, otherwise ndarray.

Return type:

ret

See also

DataFrame.astype

Cast argument to a specified dtype.

to_datetime

Convert argument to datetime.

to_timedelta

Convert argument to timedelta.

numpy.ndarray.astype

Cast a numpy array to a specified type.

DataFrame.convert_dtypes

Convert dtypes.

Examples

Take separate series and convert to numeric, coercing when told to

>>> s = pd.Series(['1.0', '2', -3])
>>> pd.to_numeric(s)
0    1.0
1    2.0
2   -3.0
dtype: float64
>>> pd.to_numeric(s, downcast='float')
0    1.0
1    2.0
2   -3.0
dtype: float32
>>> pd.to_numeric(s, downcast='signed')
0    1
1    2
2   -3
dtype: int8
>>> s = pd.Series(['apple', '1.0', '2', -3])
>>> pd.to_numeric(s, errors='ignore')
0    apple
1      1.0
2        2
3       -3
dtype: object
>>> pd.to_numeric(s, errors='coerce')
0    NaN
1    1.0
2    2.0
3   -3.0
dtype: float64

Downcasting of nullable integer and floating dtypes is supported:

>>> s = pd.Series([1, 2, 3], dtype="Int64")
>>> pd.to_numeric(s, downcast="integer")
0    1
1    2
2    3
dtype: Int8
>>> s = pd.Series([1.0, 2.1, 3.0], dtype="Float64")
>>> pd.to_numeric(s, downcast="float")
0    1.0
1    2.1
2    3.0
dtype: Float32
pandas.to_pickle(obj, filepath_or_buffer, compression='infer', protocol=5, storage_options=None)[source]

Pickle (serialize) object to file.

Parameters:
  • obj (any object) – Any python object.

  • filepath_or_buffer (str, path object, or file-like object) – String, path object (implementing os.PathLike[str]), or file-like object implementing a binary write() function. Also accepts URL. URL has to be of S3 or GCS.

  • compression (str or dict, default 'infer') –

    For on-the-fly compression of the output data. If ‘infer’ and ‘filepath_or_buffer’ is path-like, then detect compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, ‘.xz’, ‘.zst’, ‘.tar’, ‘.tar.gz’, ‘.tar.xz’ or ‘.tar.bz2’ (otherwise no compression). Set to None for no compression. Can also be a dict with key 'method' set to one of {'zip', 'gzip', 'bz2', 'zstd', 'tar'} and other key-value pairs are forwarded to zipfile.ZipFile, gzip.GzipFile, bz2.BZ2File, zstandard.ZstdCompressor or tarfile.TarFile, respectively. As an example, the following could be passed for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip', 'compresslevel': 1, 'mtime': 1}.

    New in version 1.5.0: Added support for .tar files.

    Changed in version 1.4.0: Zstandard support.

  • protocol (int) – Int which indicates which protocol should be used by the pickler, default HIGHEST_PROTOCOL (see [1], paragraph 12.1.2). The possible values for this parameter depend on the version of Python. For Python 2.x, possible values are 0, 1, 2. For Python>=3.0, 3 is a valid value. For Python >= 3.4, 4 is a valid value. A negative value for the protocol parameter is equivalent to setting its value to HIGHEST_PROTOCOL.

  • storage_options (dict, optional) –

    Extra options that make sense for a particular storage connection, e.g. host, port, username, password, etc. For HTTP(S) URLs the key-value pairs are forwarded to urllib.request.Request as header options. For other URLs (e.g. starting with “s3://”, and “gcs://”) the key-value pairs are forwarded to fsspec.open. Please see fsspec and urllib for more details, and for more examples on storage options refer here.

    New in version 1.2.0.

Return type:

None

See also

read_pickle

Load pickled pandas object (or any object) from file.

DataFrame.to_hdf

Write DataFrame to an HDF5 file.

DataFrame.to_sql

Write DataFrame to a SQL database.

DataFrame.to_parquet

Write a DataFrame to the binary parquet format.

Examples

>>> original_df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})  
>>> original_df  
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
>>> pd.to_pickle(original_df, "./dummy.pkl")  
>>> unpickled_df = pd.read_pickle("./dummy.pkl")  
>>> unpickled_df  
   foo  bar
0    0    5
1    1    6
2    2    7
3    3    8
4    4    9
pandas.to_timedelta(arg, unit=None, errors='raise')[source]

Convert argument to timedelta.

Timedeltas are absolute differences in times, expressed in difference units (e.g. days, hours, minutes, seconds). This method converts an argument from a recognized timedelta format / value into a Timedelta type.

Parameters:
  • arg (str, timedelta, list-like or Series) –

    The data to be converted to timedelta.

    Changed in version 2.0: Strings with units ‘M’, ‘Y’ and ‘y’ do not represent unambiguous timedelta values and will raise an exception.

  • unit (str, optional) –

    Denotes the unit of the arg for numeric arg. Defaults to "ns".

    Possible values:

    • ’W’

    • ’D’ / ‘days’ / ‘day’

    • ’hours’ / ‘hour’ / ‘hr’ / ‘h’

    • ’m’ / ‘minute’ / ‘min’ / ‘minutes’ / ‘T’

    • ’S’ / ‘seconds’ / ‘sec’ / ‘second’

    • ’ms’ / ‘milliseconds’ / ‘millisecond’ / ‘milli’ / ‘millis’ / ‘L’

    • ’us’ / ‘microseconds’ / ‘microsecond’ / ‘micro’ / ‘micros’ / ‘U’

    • ’ns’ / ‘nanoseconds’ / ‘nano’ / ‘nanos’ / ‘nanosecond’ / ‘N’

    Changed in version 1.1.0: Must not be specified when arg context strings and errors="raise".

  • errors ({'ignore', 'raise', 'coerce'}, default 'raise') –

    • If ‘raise’, then invalid parsing will raise an exception.

    • If ‘coerce’, then invalid parsing will be set as NaT.

    • If ‘ignore’, then invalid parsing will return the input.

Returns:

If parsing succeeded. Return type depends on input:

  • list-like: TimedeltaIndex of timedelta64 dtype

  • Series: Series of timedelta64 dtype

  • scalar: Timedelta

Return type:

timedelta

See also

DataFrame.astype

Cast argument to a specified dtype.

to_datetime

Convert argument to datetime.

convert_dtypes

Convert dtypes.

Notes

If the precision is higher than nanoseconds, the precision of the duration is truncated to nanoseconds for string inputs.

Examples

Parsing a single string to a Timedelta:

>>> pd.to_timedelta('1 days 06:05:01.00003')
Timedelta('1 days 06:05:01.000030')
>>> pd.to_timedelta('15.5us')
Timedelta('0 days 00:00:00.000015500')

Parsing a list or array of strings:

>>> pd.to_timedelta(['1 days 06:05:01.00003', '15.5us', 'nan'])
TimedeltaIndex(['1 days 06:05:01.000030', '0 days 00:00:00.000015500', NaT],
               dtype='timedelta64[ns]', freq=None)

Converting numbers by specifying the unit keyword argument:

>>> pd.to_timedelta(np.arange(5), unit='s')
TimedeltaIndex(['0 days 00:00:00', '0 days 00:00:01', '0 days 00:00:02',
                '0 days 00:00:03', '0 days 00:00:04'],
               dtype='timedelta64[ns]', freq=None)
>>> pd.to_timedelta(np.arange(5), unit='d')
TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'],
               dtype='timedelta64[ns]', freq=None)
pandas.unique(values)[source]

Return unique values based on a hash table.

Uniques are returned in order of appearance. This does NOT sort.

Significantly faster than numpy.unique for long enough sequences. Includes NA values.

Parameters:

values (1d array-like) –

Returns:

The return can be:

  • Index : when the input is an Index

  • Categorical : when the input is a Categorical dtype

  • ndarray : when the input is a Series/ndarray

Return numpy.ndarray or ExtensionArray.

Return type:

numpy.ndarray or ExtensionArray

See also

Index.unique

Return unique values from an Index.

Series.unique

Return unique values of Series object.

Examples

>>> pd.unique(pd.Series([2, 1, 3, 3]))
array([2, 1, 3])
>>> pd.unique(pd.Series([2] + [1] * 5))
array([2, 1])
>>> pd.unique(pd.Series([pd.Timestamp("20160101"), pd.Timestamp("20160101")]))
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> pd.unique(
...     pd.Series(
...         [
...             pd.Timestamp("20160101", tz="US/Eastern"),
...             pd.Timestamp("20160101", tz="US/Eastern"),
...         ]
...     )
... )
<DatetimeArray>
['2016-01-01 00:00:00-05:00']
Length: 1, dtype: datetime64[ns, US/Eastern]
>>> pd.unique(
...     pd.Index(
...         [
...             pd.Timestamp("20160101", tz="US/Eastern"),
...             pd.Timestamp("20160101", tz="US/Eastern"),
...         ]
...     )
... )
DatetimeIndex(['2016-01-01 00:00:00-05:00'],
        dtype='datetime64[ns, US/Eastern]',
        freq=None)
>>> pd.unique(list("baabc"))
array(['b', 'a', 'c'], dtype=object)

An unordered Categorical will return categories in the order of appearance.

>>> pd.unique(pd.Series(pd.Categorical(list("baabc"))))
['b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']
>>> pd.unique(pd.Series(pd.Categorical(list("baabc"), categories=list("abc"))))
['b', 'a', 'c']
Categories (3, object): ['a', 'b', 'c']

An ordered Categorical preserves the category ordering.

>>> pd.unique(
...     pd.Series(
...         pd.Categorical(list("baabc"), categories=list("abc"), ordered=True)
...     )
... )
['b', 'a', 'c']
Categories (3, object): ['a' < 'b' < 'c']

An array of tuples

>>> pd.unique([("a", "b"), ("b", "a"), ("a", "c"), ("b", "a")])
array([('a', 'b'), ('b', 'a'), ('a', 'c')], dtype=object)
pandas.value_counts(values, sort=True, ascending=False, normalize=False, bins=None, dropna=True)[source]

Compute a histogram of the counts of non-null values.

Parameters:
  • values (ndarray (1-d)) –

  • sort (bool, default True) – Sort by values

  • ascending (bool, default False) – Sort in ascending order

  • normalize (bool, default False) – If True then compute a relative histogram

  • bins (integer, optional) – Rather than count values, group them into half-open bins, convenience for pd.cut, only works with numeric data

  • dropna (bool, default True) – Don’t include counts of NaN

Return type:

Series

pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+')[source]

Unpivot a DataFrame from wide to long format.

Less flexible but more user-friendly than melt.

With stubnames [‘A’, ‘B’], this function expects to find one or more group of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’)

Each row of these wide variables are assumed to be uniquely identified by i (can be a single column name or a list of column names)

All remaining variables in the data frame are left intact.

Parameters:
  • df (DataFrame) – The wide-format DataFrame.

  • stubnames (str or list-like) – The stub name(s). The wide format variables are assumed to start with the stub names.

  • i (str or list-like) – Column(s) to use as id variable(s).

  • j (str) – The name of the sub-observation variable. What you wish to name your suffix in the long format.

  • sep (str, default "") – A character indicating the separation of the variable names in the wide format, to be stripped from the names in the long format. For example, if your column names are A-suffix1, A-suffix2, you can strip the hyphen by specifying sep=’-’.

  • suffix (str, default '\d+') – A regular expression capturing the wanted suffixes. ‘\d+’ captures numeric suffixes. Suffixes with no numbers could be specified with the negated character class ‘\D+’. You can also further disambiguate suffixes, for example, if your wide variables are of the form A-one, B-two,.., and you have an unrelated column A-rating, you can ignore the last one by specifying suffix=’(!?one|two)’. When all suffixes are numeric, they are cast to int64/float64.

Returns:

A DataFrame that contains each stub name as a variable, with new index (i, j).

Return type:

DataFrame

See also

melt

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

pivot

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot

Pivot without aggregation that can handle non-numeric data.

DataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

DataFrame.unstack

Pivot based on the index values instead of a column.

Notes

All extra variables are left untouched. This simply uses pandas.melt under the hood, but is hard-coded to “do the right thing” in a typical case.

Examples

>>> np.random.seed(123)
>>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
...                    "A1980" : {0 : "d", 1 : "e", 2 : "f"},
...                    "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
...                    "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
...                    "X"     : dict(zip(range(3), np.random.randn(3)))
...                   })
>>> df["id"] = df.index
>>> df
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -1.085631   0
1     b     e    1.2    1.3  0.997345   1
2     c     f    0.7    0.1  0.282978   2
>>> pd.wide_to_long(df, ["A", "B"], i="id", j="year")
... 
                X  A    B
id year
0  1970 -1.085631  a  2.5
1  1970  0.997345  b  1.2
2  1970  0.282978  c  0.7
0  1980 -1.085631  d  3.2
1  1980  0.997345  e  1.3
2  1980  0.282978  f  0.1

With multiple id columns

>>> df = pd.DataFrame({
...     'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...     'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...     'ht1': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...     'ht2': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9
>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age')
>>> l
... 
                  ht
famid birth age
1     1     1    2.8
            2    3.4
      2     1    2.9
            2    3.8
      3     1    2.2
            2    2.9
2     1     1    2.0
            2    3.2
      2     1    1.8
            2    2.8
      3     1    1.9
            2    2.4
3     1     1    2.2
            2    3.3
      2     1    2.3
            2    3.4
      3     1    2.1
            2    2.9

Going from long back to wide just takes some creative use of unstack

>>> w = l.unstack()
>>> w.columns = w.columns.map('{0[0]}{0[1]}'.format)
>>> w.reset_index()
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9

Less wieldy column names are also handled

>>> np.random.seed(0)
>>> df = pd.DataFrame({'A(weekly)-2010': np.random.rand(3),
...                    'A(weekly)-2011': np.random.rand(3),
...                    'B(weekly)-2010': np.random.rand(3),
...                    'B(weekly)-2011': np.random.rand(3),
...                    'X' : np.random.randint(3, size=3)})
>>> df['id'] = df.index
>>> df 
   A(weekly)-2010  A(weekly)-2011  B(weekly)-2010  B(weekly)-2011  X  id
0        0.548814        0.544883        0.437587        0.383442  0   0
1        0.715189        0.423655        0.891773        0.791725  1   1
2        0.602763        0.645894        0.963663        0.528895  1   2
>>> pd.wide_to_long(df, ['A(weekly)', 'B(weekly)'], i='id',
...                 j='year', sep='-')
... 
         X  A(weekly)  B(weekly)
id year
0  2010  0   0.548814   0.437587
1  2010  1   0.715189   0.891773
2  2010  1   0.602763   0.963663
0  2011  0   0.544883   0.383442
1  2011  1   0.423655   0.791725
2  2011  1   0.645894   0.528895

If we have many columns, we could also use a regex to find our stubnames and pass that list on to wide_to_long

>>> stubnames = sorted(
...     set([match[0] for match in df.columns.str.findall(
...         r'[A-B]\(.*\)').values if match != []])
... )
>>> list(stubnames)
['A(weekly)', 'B(weekly)']

All of the above examples have integers as suffixes. It is possible to have non-integers as suffixes.

>>> df = pd.DataFrame({
...     'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
...     'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
...     'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...     'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
   famid  birth  ht_one  ht_two
0      1      1     2.8     3.4
1      1      2     2.9     3.8
2      1      3     2.2     2.9
3      2      1     2.0     3.2
4      2      2     1.8     2.8
5      2      3     1.9     2.4
6      3      1     2.2     3.3
7      3      2     2.3     3.4
8      3      3     2.1     2.9
>>> l = pd.wide_to_long(df, stubnames='ht', i=['famid', 'birth'], j='age',
...                     sep='_', suffix=r'\w+')
>>> l
... 
                  ht
famid birth age
1     1     one  2.8
            two  3.4
      2     one  2.9
            two  3.8
      3     one  2.2
            two  2.9
2     1     one  2.0
            two  3.2
      2     one  1.8
            two  2.8
      3     one  1.9
            two  2.4
3     1     one  2.2
            two  3.3
      2     one  2.3
            two  3.4
      3     one  2.1
            two  2.9